Deurbanising the Web [pdf]

monkeynotes · on July 19, 2021

* PDFs are self-contained and offlineable

HTML can easily be offline-able. Base64 your images or use SVG, put your CSS in the HTML page, remove all 2-way data interaction, basically reduce HTML to the same performance as PDF and allow it to be downloaded.

* PDFs are files

HTML is files

* PDFs are decentralised

This should be "PDFs can be decentralised". PDFs aren't inherently any more decentralised than any other kind of file, including HTML.

The store is the thing that becomes decentralised, not the content.

* PDFs are page-oriented

HTML can be page-oriented. Simply build your website with pagination. PDFs can also be abused to have hugely long pages. Bad UX can be encapsulated in any medium.

* PDFs used to be large (bla bla bla Javascript weighs a lot)

Nope, PDFs are still objectively larger than the equivalent HTML. PDFs don't have any dynamic interaction, rip all that out and produce the HTML of yesteryear and your HTML will be tiny in comparison to the PDF.

Edit: I'm sorry, the more I think about this the dumber I feel. The web is useful because it's 2-way. I am excited by the web because I can interact with other people. I come to hacker news to engage with thinkers, not to just read a published article from one single author. I want to read ad-hoc opinions and user submitted content. PDF web, really?

LeifCarrotson · on July 19, 2021

When you find a page - inherently a document-oriented term - like an article, blog post, how-to, or project writeup that's interesting or useful, and you want to make sure it's available to you later, what do you do?

Do you save the HTML, CSS, and Javascript, and hope that it works offline? I used to use the "Save page as..." tool back in the early 2000s, but it's become less and less useful, with too many dysfunctional disappointments.

No, I cut out some junk I don't need with the Printliminator [1] bookmarklet, then I do a *print-to-PDF.* This gives me a file. I can save the file, back it up to my NAS, search for it later, keep it with other files from a project where it was useful, and otherwise hang onto it. This is so common, in fact, that it's gone from being an obscure thing you could do with a Postscript-to-PDF converter or (before the adware/Ask toolbar scandal) the installing the CutePDF virtual printer. Modern OSes bundle a PDF printer, and print dialogs understand that you want to "Save as PDF". Google Docs and Office 365 editors allow downloading a document as a PDF.

I totally agree that a dynamic, interactive page or a comment section is not compatible with this model of usage. There's a lot of consumption of endless feeds, and a lot of one-time video views that also don't make sense to save as offline files. However, the web for creators, where people write articles that are worth hanging onto, has a definite place for PDFs.

[1]: http://css-tricks.github.io/The-Printliminator/

derefr · on July 19, 2021

> When you find a page [...] and you want to make sure it's available to you later, what do you do?

Instead of doing a bad and lossy job of archiving the page myself, I notify† our friendly neighbourhood archivists at the Internet Archive of the page; and they then do the best, most lossless job of preserving the page that they're able, given their cumulative experience.

† http://blog.archive.org/2017/01/25/see-something-save-someth...

As a side-benefit, they also then take care of keeping the archive they've made around and available online in perpetuity, with no additional marginal effort on my part. The same can't be said for something in my own "private collection."

daggersandscars · on July 19, 2021

This may not be well-known, but archive.org can and does remove pages / sites from the archive. Authors can request this, site owners (separate from the authors) can request this. There may be others who can request this.

Just an FYI. If there are critical sites you want copies of, I'd recommend making your own copy. I've lost access to important pages / sites twice before taking this to heart.

Edited for clarity

Santosh83 · on July 19, 2021

There is value in having a personally curated, offline collection of documents. You can search, annotate or otherwise manipulate it to your heart's content, all without having to be connected.

Of course the Internet Archive serves other purposes for which it is (currently) irreplaceable.

cxr · on July 19, 2021

Zotero is much better for this than the too-fiddly print-to-PDF workflow described in the earlier comment.

admax88q · on July 19, 2021

There's also opportunity cost in spending time maintaining, indexing, annotating your own archive of documents.

tenebrisalietum · on July 19, 2021

> in perpetuity

Hopefully it really is around a very long time, but the world is unpredictable and things change. It's great to enhance the Internet Archive, but you can bet I'm keeping my local copy too. Just in case.

htek · on July 19, 2021

That's subobtimal as well. The site could come out with a new robots.txt file which is just <code>User-agent: * Disallow: /</code> and everything already indexed by the Internet Archive is now inaccessible to you.

turtlebits · on July 19, 2021

Do you never get online receipts that you need to keep a copy of?

derefr · on July 19, 2021

I don't think I've ever had such a thing that only appeared as a web page, without being emailed to me. To me, the email is the primary-source document in that arrangement.

gregsadetsky · on July 19, 2021

There was an interesting discussion about this a year ago:

https://news.ycombinator.com/item?id=23228098

——

This is still not as powerful as my one, simple trick to handle all bookmarks, ever: Print to PDF. I've been doing it since last century, and I have 10's of thousands of PDF's of every single web page I've ever found interesting, sitting right there in a directory on my computer

——

Including the suggestion that was brought up to use ripgrep to search in the pdf text content.

anigbrowl · on July 19, 2021

Sometimes if I'm researching a topic I'll dig up a big number of newspaper articles and want to print them and read them away from the screen while scribbling notes etc, but on a lot of websites banner ads or footers with copyright statements can really mess it up.

apotheon · on July 19, 2021

I actually dislike HTML per se, but the only two benefits I see for PDFs in the general case are:

- In my experience, it's a little harder and rarer to make PDFs utterly incompatible with different means of viewing them, and it generally requires more overt (if perhaps slightly unintentional, at times) sadism to make that happen.

- PDFs can do some things HTML can't (easily, at least) with document design -- though those things are generally things that would be disallowed in our new "deurbanized" PDF-based web replacement.

Everything else that comes to mind goes the other way, including the fact that the viewing-mechanism incompatibility thing can be even worse with PDFs, even if it's more rare for that to happen at present, and if PDFs became the new standard for the web I'm pretty sure that relative rarity would evaporate anyway. Let's also not forget that HTML can also do some things PDFs can't (as easily, at least) do.

jhgb · on July 19, 2021

> Do you save the HTML, CSS, and Javascript, and hope that it works offline? I used to use the "Save page as..." tool back in the early 2000s, but it's become less and less useful, with too many dysfunctional disappointments.

I'm too lazy, so I just tend to use SingleFile these days...

blooalien · on July 19, 2021

Also useful: https://pypi.org/project/html2text/

Bjartr · on July 19, 2021

I've used chrome's ability to save a single .mhtml file that contains all the resources for this purpose in the past.

camgunz · on July 19, 2021

You got nerd sniped by the HTML vs. PDF format thing and missed the entire point of TA:

> Isn’t it a good thing that we enjoy rapid progress? To the extent that we get to enjoy things like YouTube and sandspiel, yes! But to the extent that we want the internet to be a place where we can work and live and think and communicate free of malware, surveillance, dark patterns and the insidious influence of advertising, the answer is, empirically, sadly, no. The web has become ad-corrupted hand-in-hand with growth in technological capability, and the symbiotic relationship between web and browser means they feed on each others’ churn. Ads demand new sources of novelty to put themselves on, so the web expands continually, the specs grow in complexity, the browsers grow in sophistication, the barrier to entry grows ever higher, the vast cost of it all demands more ad revenue to fund it... and thus the perpetual motion machine is complete.

cxr · on July 19, 2021

The author does identify a problem, and so you want to focus on that. That's fine. There is the issue of triviality, however.

The problem described is widely felt, and also widely discussed. We already know this stuff to be a problem. For the piece to be worthwhile, then, it should do something that is not present in the other instances where the topic has been raised. It should articulate (or at the very least exhibit, without necessarily articulating) a solution for us. It doesn't. A bad remedy to a genuine problem does not yield a solved problem.

camgunz · on July 19, 2021

The article is called "Deurbanising the Web", and its thesis is:

- Publish in static file formats.

- Date and hash your work.

- Stop spying on your users.

HN is a discussion forum, not project planning software. Not everything has to "yield a solved problem". Are you really setting the bar at "design a technology stack for replacing HTML/CSS/JS"? That's way, way too high.

bccdee · on July 19, 2021

You say that its thesis is (in part) to generally publish in static file formats, but that's not quite accurate. The piece specifically touts PDF/A as the best format and makes several arguments against the use of html/css. I agree that they're making a broader point than just "use pdf," but "use pdf" is definitely a large part of it.

apotheon · on July 19, 2021

Those points can be trivially met with static HTML and something like IPFS, and you can still download HTML for local storage and viewing. You can even print to PDF if you really want to do so. Meanwhile, PDFs also allow dynamic files, don't require dating and hashing, and can be used to spy on users or deliver malware.

EDIT: Oh, yeah, and static file formats doesn't necessarily have to mean static document formatting when viewing -- unless you're using PDFs, which tends to break useful stuff like reflowing for paginated documents (one of the worst things about even simple PDFs).

xialvjun · on July 21, 2021

ipfs solve this well.

slashdot2008 · on July 19, 2021

The author brings a solution, it is to publish documents in PDF instead of HTML.

apotheon · on July 19, 2021

"A bad remedy to a genuine problem does not yield a solved problem."

slashdot2008 · on July 20, 2021

PDF is a great way to publish documents. which is what the web originally was.

The web has become a bad remedy to some distributed software problems.

tovej · on July 20, 2021

Why do you feel PDFs are a bad remedy? PDFs are the usual way I absorb information.

prophesi · on July 19, 2021

No, the entire point of the article is to convince people to use PDF/A. Which I find comical since you have to go out of your way to check if a PDF is PDF/A compliant. If the web was run by PDF's, there's no reason why any big corporations would abide by those rules, and it'd be just as messy as HTML is today.

camgunz · on July 19, 2021

You've also been nerd sniped. TA goes on and on about surveillance capitalism and the attention economy. Weird, for an article that's supposedly convincing engineers of the merits of one file format over another.

prophesi · on July 19, 2021

Did you read beyond the "How did it come to this?" section? TA goes on and on about web standards and the need for PDF/A.

Edit: If the article _was_ all about surveillance capitalism, then it wouldn't be worth upvoting as actionable solutions are much more valuable than preaching to the choir.

camgunz · on July 19, 2021

If you don't think it's clear that the author's advocacy of PDF is a means to an end, subservient to their desire to dismantle surveillance capitalism and the duopoly that Google/Apple have on the web, I don't know where to go from here.

anigbrowl · on July 19, 2021

why don't we have both?

prophesi · on July 19, 2021

I think you're the one who got nerd-sniped here. 1.5 of the 13 pages in this PDF are about surveillance capitalism. The rest's about web standards.

Aeolun · on July 19, 2021

What in the nine hells is nerd sniping?

camgunz · on July 22, 2021

It's when you trick a technically-minded person into jumping down a rabbit hole of a technical problem/controversy. Here it's PDF vs. HTML, but other classic nerd snipes are UTF-8 vs. anything else, "fixing" election tech, etc.

monkeynotes · on July 19, 2021

I tackled the premise. I think addressing the premise is the logical place to dismantle an argument.

camgunz · on July 19, 2021

But, again, the premise is not that "as a file format, PDF is better than HTML". The premise is: because HTML is two-way, it enables surveillance capitalism and allows bad actors to monopolize the attention economy. The author wrote it thus:

> Sure, you can write good HTML. I won’t argue with that. And if you’re writing good HTML, good for you. But HTML is a dual-use technology, the bad guys are dual-using it an awful lot, and I feel that the stone age still has a part to play in the progression of the information age.

The part where you engage with this is where you write:

> I'm sorry, the more I think about this the dumber I feel. The web is useful because it's 2-way. I am excited by the web because I can interact with other people. I come to hacker news to engage with thinkers, not to just read a published article from one single author. I want to read ad-hoc opinions and user submitted content. PDF web, really?

Which is interesting! Do you have thoughts on creating peer-to-peer systems that don't enable surveillance capitalism?

apotheon · on July 19, 2021

> > Sure, you can write good HTML.

A key here is that it's easier to write good HTML docs than good PDF docs, and much harder to deal with the harmful aspects of PDF docs given present technology.

> Which is interesting! Do you have thoughts on creating peer-to-peer systems that don't enable surveillance capitalism?

I don't know about the other person's ideas, but decentralization plus better anonymization and pseudonimization, with always-on strongest-reasonably-posible encryption, seems like the direction to go.

camgunz · on July 19, 2021

> A key here is that it's easier to write good HTML docs than good PDF docs, and much harder to deal with the harmful aspects of PDF docs given present technology.

Oh, yeah I'm not on the PDF train. That's wild. I'm more of a Markdown or Gemtext advocate, or even LaTeX.

> I don't know about the other person's ideas, but decentralization plus better anonymization and pseudonimization, with always-on strongest-reasonably-posible encryption, seems like the direction to go.

Yeah, projects like IPFS (which you reference above) are working towards this, but JavaScript still works over IPFS. Plus, fingerprinting techniques are pretty bonkers. Most of it comes down to JS and various state you keep on your local machine (cookies, flash cookies, etc.), but I think you need that. How do you maintain a session with a peer without some kind of token/cookie?

monkeynotes · on July 20, 2021

> Do you have thoughts on creating peer-to-peer systems that don't enable surveillance capitalism?

Yes, it's call TOR. However, legislation is where we should start. Crippling/abandoning an incredibly useful technology which works very well just because it's often used nefariously seems to be a bit of an overreaction.

Until then, stop using social platforms, use an ad blocker, and use VPN if you really care about "surveillance capitalism".

6510 · on July 19, 2021

The classic mistaking the example for the topic.

hyperpape · on July 19, 2021

Saying HTML can be offlineable is like saying C can be provably terminating. There's a subset of programs where that's true, but it's not inherent to the form. A PDF is inherently self-contained, standard web technologies are not. When you open the page and it's a PDF, it gives you certain guarantees, when you open it and it's HTML, you have to have to do further investigation.

lucideer · on July 19, 2021

Firstly, C being provably terminating is a problem dealing with the full body of C programs written in the world. The OP is dealing with their own self-published content. That's a different problem: if your analogy held it would need to be limited to proving that a subset of C programs written by the author terminate.

Secondly, the level of difficulty in making HTML offlineable is many orders of magnitude simpler than your C analogy: there's really no comparison. For the OP we only need to make HTML documents that they have authored themselves offlineable and yet people have written general purpose tools to do this automatically for most webpages. This is not a hard problem.

TL;DR your analogy is absurd.

hyperpape · on July 19, 2021

This is a helpful post because it gets to the heart of the difference. Many people are saying "if you do HTML in a particular way, you get the same benefits." I'm asking "what's inherent to the form?" That's exactly the point about C--you can write it in a way that's provably terminated, but it's not guaranteed. Consider the consumer's perspective.

When I land on a page that's a PDF, I know certain things--I can easily save it and read it later. How do I know that? Not because I have read the PDF spec, or know that much about it, but because of my experience as a consumer of the web.

When I land on an arbitrary web-page, do I know the same thing? No. I don't know what the page is doing, I don't know what my browser will do when I try to save the page. When I save this page, I have the option to save HTML only, or a complete web page. Will the complete page actually work? I go into the source, and there's a link to the javascript (which is saved locally). Does rendering the page rely on that javascript? Does that javascript do xhr or fetch calls? Since it's Hacker News, I suspect the answer is no. However that's not inherent to the medium.

There are better ways to archive the content of even dynamic JS heavy pages, but they are not things that you learn as an average user of the web.

apotheon · on July 19, 2021

It's possible to write PDFs that don't "work" (for some useful definition of "work" similar to the case with HTML) offline. Please stop pretending that's not true.

The reason offline utility tends to be true more often for PDFs is that PDFs are not generally regarded as the preferred online-default format of choice, which is in turn a matter of social effects rather than technical capacity. Reverse the socially accepted roles of the two document formats and watch the same complaints get made against PDFs as you're making against HTML. I'd bet money the "normal" state of affairs would remain the same in terms of the perceived benefit/detriment allocation between online/offline formats; only which format was considered which would have changed.

. . . but then all the web would be even heavier documents, and even less customizable for local viewing, thanks in part to that pagination and strict formatting situation.

anigbrowl · on July 19, 2021

It's possible, but it takes work. I can't remember the last time a pdf did something unreadably weird, usually my only gripe is with something that's a scan of an old document but whoever turned it into PDF didn't do OCR.

lucideer · on July 19, 2021

I don't really follow. How does this author converting their entire site to PDF help readers/visitors/users?

The original HTML site[0] was printable as PDF, and save-able as both HTML and "Web page, complete", all of which result in a well-formatted & readable offline experience. (It was also responsive: very readable on mobile, but that's an aside).

The new PDF site is not accessible to some, difficult to read on mobile, and interacts poorly with all of the norms web users are accustomed to (back navigation, anchors, etc.)

[0] https://web.archive.org/web/20130127175816/http://www.lab6.c...

hyperpape · on July 19, 2021

It's the difference between "this thing has X property" (termination or able to save for offline reading) and "this thing _obviously_ has X property, in a way that you can tell without any expertise, or doing any investigation".

How important this is to users, or whether it is worth it is something I've not commented on, but it is a difference.

ksec · on July 21, 2021

Yes. The sort of discussion happening every day between a product manager and engineers.

chalst · on July 19, 2021

hyperpage's analogy would work if the property was "avoids undefined behaviour", rather than "avoids nontermination". When we encounter a webpage, we are being expected to execute potentially complex, well-being threatening code whose behaviour is about as easy to predict as obfuscated C.

lucideer · on July 19, 2021

True but again only if we're talking about parsing the web. This is about HTML files the author is producing themselves.

apotheon · on July 19, 2021

PDFs are capable of the same issues.

JadeNB · on July 19, 2021

> When you open the page and it's a PDF, it gives you certain guarantees ….

I think that this is a lot less true than we're used to thinking. The PDF spec contains a lot more interactive capabilities than I think most people realise. (It supports JavaScript!) We're not used to seeing those capabilities abused, because there's no point; it is so much easier to abuse HTML. But, if people want to abuse PDF—and, if we somehow convinced the world to move to it, then they would—then they easily can.

(I'm not conversant enough in the spec to know, but I do know that Postscript is Turing complete, and I don't know that PDF isn't. At least HTML on its own certainly isn't—no recursion!—although all bets go out the window once you start layering other tech on top of it.)

monkeynotes · on July 19, 2021

I don't buy that the problem with the web is that HTML is not inherently offlineable. HTML may not be inherently offlineable but it can be. PDF isn't inherently a web friendly format, but it can be. There really isn't any good argument for PDFing the web.

pajko · on July 19, 2021

Print the page to PDF.

tablespoon · on July 19, 2021

> Print the page to PDF.

Even that usually sucks nowadays, because web developers don't care anymore. Probably 75% of the time before I do that, I have to go into the dev console to delete overlay elements that obscure content and garbage that will waste 10 pages (e.g. grossly oversized images, related article recommendations, etc.).

There was a time when most websites had a print view that gave you a simplified html page that worked well, but I think most of those are gone now. Now it's all some print "media-type" CSS that no one ever put the time in to do properly or keep up to date.

stjohnswarts · on July 19, 2021

I agree, I don't see why anyone can call publishing in PDF is "dumb". The author of the material gets to choose his medium. If "you" don't like it then move along or convert it to your preferred format. In other words "why not both?"

hypertele-Xii · on July 20, 2021

I bet HTML to PDF is a lot easier conversion than PDF to HTML.

Formats matter.

EugeneOZ · on July 19, 2021

> A PDF is inherently self-contained, standard web technologies are not

What technologies exactly? You can have absolutely everything you need inside the HTML. You can inline css, js, svg and images. What technologies you can’t inline?

aenigma · on July 19, 2021

you are correct that you CAN - but who does. That's no longer considered best practice. The arugment these days is that it's a lot easier to manage css if it's in a separate file, same with js, etc. So none of the serious web developers actually do anything inline anymore. The time it would take to convert a "best practice" website with separate files for html, css, js, etc. is just not worth it. The point he's making is still valid - why not have the option for something static.

EugeneOZ · on July 19, 2021

But with the same (and even much bigger) success you can declare “I’m switching to self-contained HTML! No more external resources!” instead of “I’m switching to PDF, saying farewell to interactivity and mobile devices”.

It's just the declaration of ONE person, switching ONE site.

apotheon · on July 19, 2021

> why not have the option for something static

You have the same option with either HTML or PDF:

- PDF files can be dynamic or static, depending on how you write them.

- HTML files can be dynamic or static, depending on how you write them.

tablespoon · on July 19, 2021

>> * PDFs are self-contained and offlineable

> HTML can easily be offline-able. Base64 your images or use SVG, put your CSS in the HTML page, remove all 2-way data interaction, basically reduce HTML to the same performance as PDF and allow it to be downloaded.

You're missing the point. Even a relatively computer-illiterate person can easily save a PDF to my hard drive, and it's significantly more difficult with HTML. At a minimum you're probably going to get an HTML file with a sidecar directory (or I believe a sometimes browser-specific archive, it's been a long time since I tried since it works so poorly), and even that may not have the content you want to due to dynamic sites.

monkeynotes · on July 19, 2021

As I explained, if the author wants to make HTML easily offlineable then inline CSS and Base64 images. Or, you know, make your website printable. If authors actually thought about the print to PDF "problem" it could be solved with traditional CSS and HTML. As someone else said, we used to do this. It used to be part of my every day web design job to make sure the page printed nicely.

The idea that the whole web is going to pander to edge case archivers is asinine. This whole conversation is about supporting the needs of the very, very few and romanticizing about the time when only interesting people used the internet. It's kind of elitist and self serving.

enumjorge · on July 19, 2021

I guess I don’t really understand the point being made. Does it matter that much that saving a page create a single file in your hard drive? If you really want a static rendering of a site why not just print it to a PDF. Why does that have to dictate the file format you use for distribution? With PDFs you don’t have to worry about conversion but they are also comparatively larger over the wire.

> even that may not have the content you want to due to dynamic sites

But PDFs also don’t give you dynamic content. Nothing is stopping people from using HTML to serve static, JS-less content. In fact that’s what it was originally designed to do. All this web app stuff was bolted on afterwards, and it’s optional.

What do we accomplish by having some people switch over to PDFs? The people who don’t care about bloat will continue to not care about it. It’s not like thin content will become more discoverable or more common. It doesn’t really change incentives. The author says using PDFs makes it so you’re not tempted to add cruft to your sites but that’s not really a compelling argument.

Getting content creators to produce content without bloat is not really a technical problem. It’s a cultural and economic one. I don’t see how a file format addresses that.

fjtktkgnfnr · on July 19, 2021

> Does it matter that much that the artifact of saving a page be a single file in your hard drive?

Yes, it matters a lot. Word/Excel files are actually a zip archive containing many files and sub-directories. Can you imagine people working with exploded Word files, sending over mail and WhatsApp complete directory trees?

spion · on July 19, 2021

The file format restricts the possibilties. You know what to expect when you see a PDF - static, JS-less content. With HTML on the other hand, it depends on what the author decided.

JadeNB · on July 19, 2021

> You know what to expect when you see a PDF - static, JS-less content.

You know to expect that, but there's no guarantee that's what you get. PDF supports JavaScript too.

MisterBastahrd · on July 19, 2021

Or I could just make sure that my page prints reasonably well (we used to do this) and use the print-to-pdf functionality available in modern browsers.

apotheon · on July 19, 2021

You can write HTML pages to be self-contained and offline-friendly.

You can write PDFs to include resources that are not part of a single, self-contained file, and to be quite unfriendly with offline use.

justusthane · on July 19, 2021

But if you want a page in PDF, you can print it to PDF. Sure, non-computer-savvy users might not know how to do it off-the-bat, but browsers make it pretty easy.

tablespoon · on July 19, 2021

> But if you want a page in PDF, you can print it to PDF...

Printing a page to PDF usually sucks: See https://news.ycombinator.com/item?id=27883028

justusthane · on July 20, 2021

Oh, I know that. I just meant that if your goal is for the website to be easily archivable, rather than publishing the website as PDF you could use simple HTML which wouldn't suck when printed to PDF.

stzups · on July 19, 2021

>> it's significantly more difficult with HTML

Right Click > Save as

Try it with this page!

tablespoon · on July 19, 2021

> Right Click > Save as

> Try it with this page!

Say hello to your new sidecar directory (or broken CSS/images/God knows what else)!

I tried to save an NY Times article, and it 1) needed JS to display anything, 2) even with the sidecar stuff was broken, 3) it was so plastered with ads and other junk I thought it was incomplete (it wasn't, I just had to scroll waaay down past something that looked like a footer and some voids after that).

If you save a PDF, you get that exact PDF on your hard drive, and when you open it (even in 10 years) it will look exactly the same as it did on the site.

With PDF WYSIWYS: What you see is what you save.

trey-jones · on July 19, 2021

This is of course the point of the article - that the web is a giant steaming pile of shit for the most part, plagued by JS and external resource requirements, all of which contribute to massive total page size.

I'll preface by saying I have some expertise in HTML, but none in PDF (the format).

The point of most commenters who suggest that HTML is still a better alternative than PDF (I agree), are assuming that if this is an important issue to you, that you would craft your page in a simpler style compared to most of what we see on the web, making Print to PDF or Save As... more viable.

  > PDFs and a PDF tool ecosystem  exist today. No need for another ghost town   GitHub   repo   with   a   promising   README   and   v0.1   in progress.

This is news to me. I'm not sure that I buy it. PDFs have always been a pain in the ass to work with in my opinion. Maybe there are tools, but in my experience they aren't very good.

In general, we know that HTML is going to be much more compact (and compressible!) than PDF and that's the biggest advantage I see on a web where bandwidth still matters. Another downside shows itself by trying to copy and pasting the above quote: PDF formatting seems to be weird.

chalst · on July 19, 2021

> In we know that HTML is going to be much more compact (and compressible!) than PDF and that's the biggest advantage I see on a web where bandwidth still matters.

PDFs can be tiny if they do not embed fonts. Serving fonts is very much a complex technology in HTML world.

Browsing the web is a pain in the ass if you don't use a browser compliant with up-to-date standards, but the whole "HTML can be lightweight" argument pretty much depends on avoiding much of today's standardisation. As an objection to the original argument, it is not comparing like with like.

tablespoon · on July 19, 2021

> This is news to me. I'm not sure that I buy it. PDFs have always been a pain in the ass to work with in my opinion. Maybe there are tools, but in my experience they aren't very good.

> In general, we know that HTML is going to be much more compact (and compressible!) than PDF and that's the biggest advantage I see on a web where bandwidth still matters. Another downside shows itself by trying to copy and pasting the above quote: PDF formatting seems to be weird.

PDF is a display format. I once worked on a project parallel to a guy who was parsing PDF to extract text content. IIRC, Text in PDFs is stored in a way that works fine for printing/rendering but not so well for manipulation (e.g. it's a bunch of commands to render line Z at position X,Y with font W). Those commands don't have to be in reading order, nor do they have the semantic meaning you can get from markup like HTML (e.g. superscript can just be nothing more than a different line rendered with a smaller font).

IMHO, PDF is actually less optimal than HTML for what this guy is advocating, except that it's those precisely those limitations that have prevented PDF from becoming the mess than Web HTML has. Though, that's probably in large part because the bloaters have been too distracted by the easier-target that is HTML to bother.

romwell · on July 19, 2021

Yeah, no. Try it with any other page, and see why nobody would be inclined to even try "Save As.." a web page anymore.

biztos · on July 19, 2021

I actually did this pretty recently, in an attempt to get some magazine articles onto my Kobo e-book reader since Pocket couldn’t fetch the paywalled ones (I do pay).

I figured I could just save the page, automate a few edits to get around dynamic stuff, and then use it as, you know, an HTML document.

Even with a nice friendly mostly-text literary magazine, after about five hours I gave up and just copy-pasted the rendered text.

JadeNB · on July 19, 2021

> >> it's significantly more difficult with HTML

> Right Click > Save as

> Try it with this page!

HN is not a good site to illustrate the unpleasantnesses of navigating the modern web. As you'd hope for a hacker news site, it is very friendly to this sort of thing. Most sites aren't.

naravara · on July 19, 2021

> You're missing the point. Even a relatively computer-illiterate person can easily save a PDF to my hard drive, and it's significantly more difficult with HTML. At a minimum you're probably going to get an HTML file with a sidecar directory (or I believe a sometimes browser-specific archive, it's been a long time since I tried since it works so poorly), and even that may not have the content you want to due to dynamic sites.

Ctrl+P -> Save as PDF

You don't need the page to be a PDF to save it as a PDF.

playpause · on July 19, 2021

These all seem like technical quibbles that miss the point.

monkeynotes · on July 19, 2021

The guy outlines his whole case based on those exact points which are, as you have observed, technical quibbles and not a basis for abandoning HTML.

Under the hood it seems apparent to me that the real premise is an emotional one, not a technical one.

The internet is plastic not because of HTML, but because of money and people. When you have teens driving content it's going to feel plastic. When Walmart uses the internet to sell you crap it's gonna be plastic. Gossip / social platforms are trash, no matter the medium.

It could be argued that TV is an incredible learning platform ruined by HD. Back in the standard definition days we had proper news, documentaries that were substantial, and no reality TV. We need to go back to black and white standard definition.

Sorry, but the PDF web is not a solution to societal rot.

tablespoon · on July 19, 2021

> The guy outlines his whole case based on those exact points which are, as you have observed, technical quibbles and not a basis for abandoning HTML.

He's actually more of a social observation: it doesn't matter what the technology can do, what matters how how the developers of that technology actually use it.

People who use PDF almost never use 3D graphics and heavy dynamic JS, so PDFs almost always have many of the qualities he's seeking.

Web developers almost never inline anything, and do all kinds of things that are arguably deal-breakers except for a few lowest-common-denominator use cases.

> Under the hood it seems apparent to me that the real premise is an emotional one, not a technical one.

The premise is that the web has failed in important and clear ways, it's impossible to fix so we should give up, so many use cases should abandon it for something else, and PDFs are unexpectedly well suited for that.

On a related note, part of me wishes Java Applets never died. Getting rid of them seems to have caused the Web to turn into them, and maybe if they'd remained some kind of separation could have been maintained.

apotheon · on July 19, 2021

Turning PDFs into the replacement for HTML would change the incentives around PDF authoring, and PDFs would then acquire the same problems identified with HTML.

The solution to the identified problems is not to switch to PDFs. Stop reshuffling the chairs on the deck of your sinking ship, and start figuring out how to design, implement, and incentivize the use of, some means of conveyance other than iceberg-vulnerable ships.

> On a related note, part of me wishes Java Applets never died. Getting rid of them seems to have caused the Web to turn into them, and maybe if they'd remained some kind of separation could have been maintained.

Java Applets were killed by Flash.

chalst · on July 19, 2021

> PDFs are unexpectedly well suited for that.

Not so surprising, really: the PDF standard evolved in parallel with Adobe's Flash between 2005 and 2010, which was then the key technology in Adobe's effort to keep a strategic toehold on the web. If Flash had not been a security clusterfuck, it might still be around. The PDF standard was always meant to be a complementary standard, and Adobe's attempted successor technologies have followed an even closer technological path.

The PDF standard has benefited from the fact that, unlike the W3C and WHATWG, surveillance capitalists have not been in the driving seat of its standardisation effort. Adobe's interests are not identical to those of the public, but they are not as essentially adversarial to them as the web standards bodies have been.

adolph · on July 19, 2021

Is the medium the message? Does style have substance? Is form also a function?

leetcrew · on July 19, 2021

I'm not exactly sure what point you're trying to make here, but I don't think two different formats for encoding formatted text with images constitute different "mediums".

megameter · on July 19, 2021

Of course they are, and we run into it constantly in computing. You can encode text with images as a bitmap, as vector graphics, as symbolic content that references bitmaps or vectors, as an algorithm that procedurally generates any of the above...

While you can produce identical outputs from the different methods, it's not hair-splitting to say that the authoring process and hence the nature of the medium to shape expression is affected by choosing one. When you opt towards maximizing generality your production cycle can grow without bound because everything is possible by layering different media, even if all of it is unnecessary. That's how you end up with creative projects that take multiple years to decades to accomplish.

runawaybottle · on July 19, 2021

Well, you seem to get the gist of the hot take the author put out. This article is not about PDFs. There is something wrong with the world and we can sense it.

This is close to it: When you have teens driving content it's going to feel plastic.

Youth is the ultimate quality destroyer. They just fucking suck. I’m quite sick of their drivel honestly, and yet, we let them dictate the world (watch my childish cartoons, even in old age).

And the little shits complicate code bases. All you little rascals under 30, scram, I’m on to you.

And all you little adults acting like children, with your stupid motivational posts on LinkedIn, and your garbage bragging on there, I see you too.

Stop.

wlesieutre · on July 19, 2021

Unless I'm on a paper-sized tablet I would definitely rather have an offline HTML file than a PDF. Nobody likes to pan back and forth on lines of text to read something.

Robotbeat · on July 19, 2021

I had the exact opposite reaction. I’m reading this on an iPhone SE2020, and I MUCH appreciate reading this in pdf form. I didn’t have to pan back and forth or even put the phone in landscape orientation. This is one of the smallest smartphones you can still buy, and the experience of PDF is WAY better than the user-hostile auto-flow text forced down mobile users’ throats.

I was skeptical at first, but I think the author made the point fantastically well.

cunthorpe · on July 19, 2021

What.

Your browser has a zoom functionality that lets you make the text smaller, essentially replicating the PDF site above. Only the opposite of what you say is correct: I can’t read that PDF’s text without turning my phone into landscape and picking up my glasses.

wlesieutre · on July 19, 2021

To get equally small text on my desktop I have to turn the font size all the way down to 7. God forbid you have readers with less than stellar eyesight.

I get what they're going for but the PDF is not exactly an accessible reading experience.

nemetroid · on July 19, 2021

I’m using a 2016 iPhone SE, and it’s largely unreadable without being very up close.

apotheon · on July 19, 2021

EPUB would beat the shit out of PDF for that.

(EPUB is basically a subset of HTML with client-oriented context.)

pseingatl · on July 19, 2021

PDF is size-agnostic. There's nothing to stop you from creating documents the size of a phone screen.

wlesieutre · on July 19, 2021

I’m commenting here as a user reading a PDF. The fact that someone else could have laid it out differently doesn’t change the fixed layout of the PDF that I’m trying to read.

There’s a reason responsive design has been a big deal for the last 10+ years and I don’t think the benefits of PDF are worth throwing it out.

JohnFen · on July 19, 2021

As someone who really detests responsive design, the lack of it in a PDF strikes me as a feature, not a bug.

quietbritishjim · on July 19, 2021

> These all seem like technical quibbles that miss the point.

If these all "miss the point", what is the point?

It seems to me that the article's point is that PDF as a format has attributes that satisfy the author's goal, whereas HTML does not. The parent comment says that HTML does have those attributes after all (if you choose to use HTML that way). That is very directly addressing the article's point, as I understand it.

JohnFen · on July 19, 2021

Perhaps I misunderstood, but I believe the author's point was to highlight what a steaming mess the modern web is. The PDF aspect strikes me as illustrating a point, not a seriously proposed solution.

jedimastert · on July 19, 2021

This statement could be for both the comment you're replying to and the original article.

Frost1x · on July 19, 2021

>PDFs don't have any dynamic interaction...

Just a caveat to that statement, you can literally do interactive and dynamic 3D graphics rendering in PDFs: https://helpx.adobe.com/acrobat/using/enable-3d-content-pdf....

You can also embed JS in PDFs: https://helpx.adobe.com/acrobat/using/applying-actions-scrip...

dathinab · on July 19, 2021

Yes, and many of this things are "in general" not well supported by anything but adobe PDF.

Even most simple interactive things can easily not work correctly even in more widely spread PDF readers.

IMHO PDF is in many ways worse then HTML, it's just that this ways are less commonly used, but if you start a PDF instead of HTML trend it's just a matter of time until this "not so compatible" aspects of PDF become widely used by some people.

monkeynotes · on July 19, 2021

JS in a PDF? You can do that in HTML, why not use the tools you already have that work together by design?

This guy is arguing that removing JS is what makes the web better. Having published, static, paper-like content is the way forward.

Frost1x · on July 19, 2021

Just caveating a technical statement I knew wasn't quite true, not making any sort of assessment either way.

As someone who has had to extract data from large sets of PDFs and modern web presentation formats, I'm not a fan of either, really. Even verifying that a visibly presented string exists in a PDF document programmatically can be a non-trivial task, as with a given website as well. That to me says a lot.

chalst · on July 19, 2021

monkeynotes seems to take the line that technical defects in claims others make fatally undermines their case, but technical defects in his/her arguments are irrelevancies.

For what it's worth, the same objection occured to me. The use of scripting I've seen in PDFs has been use-supporting and consistent with their book-like feel.

rexreed · on July 19, 2021

Also - how are PDFs exactly "discoverable"? I have petabytes of PDFs and making them easily "discoverable" for any mass use, such as analytics, search, or data analysis is a massive pain. I'd rather have them in a non-PDF format.

relaxing · on July 19, 2021

The author calling for new content to be authored as PDF, which can easily be made discoverable.

I’m guessing your data set is made of scans with poor or no OCR.

rexreed · on July 19, 2021

Not a single researcher or data analyst I know of would prefer "discoverable" content to be in PDF format, regardless of just how awesome the OCR is (which it often isn't, especially for tabular data). Even for all-text, non-tabular documents, OCR does not provide the metadata needed to make sense of the documents. Why PDF is claimed to have superior "discoverability" in the OP essay is a mystery to me. For the sake of "discoverability", PDF is definitely not the way to go.

relaxing · on July 19, 2021

The essay claimed

> PDFs are discoverable. Search engines index them as easily as any other format.

What you’re taking about has nothing to do with that.

noduerme · on July 19, 2021

Honestly, if you're going to put out a manifesto as a PDF, at least take some time "layouting" your design. The one advantage of that format is that you control the aspect ratio. Every font is permissible, everything is absolutely positioned. Using a generator to create it is cringey. Show the art that's possible. Really sell the format.

FWIW I deliver PDFs daily as an art director; not ideal, but they work in most cases. There's certainly nothing rebellious or non-commercial about them.

EugeneOZ · on July 19, 2021

...and difficult to read on the small screens of mobile devices.

noduerme · on July 19, 2021

Yeah. That's why they're only used for print.

chowderman · on July 19, 2021

> HTML can easily be offline-able. Base64 your images or use SVG, put your CSS in the HTML page, remove all 2-way data interaction, basically reduce HTML to the same performance as PDF and allow it to be downloaded.

I built a tool for this exact purpose[0] since the HTML specification and modern browsers have a lot of nice features for creating and reading documents compared to PDF (reflow and responsive page scaling, accessibility, easily sharable, a lot of styling options that are easy to use, ability for the user to easily modify the document or change the style, integration with existing web technologies, etc.). In general I would rather read an HTML document than the PDF document since I like to modify the styling in various ways (dark theme extensions in the browser for example) which may be hard to do with a PDF, but its more of a personal preference. Some people will prefer that the document adjusts to the screen size of the device (many HTML pages), and others will prefer the exact same or similar rendering regardless of the screen size (PDF).

Either way, kind of a fun idea making a website using just PDFs. Not the most practical choice, but fun none-the-less.

[0] https://github.com/chowderman/hyperfiler

supperburg · on July 19, 2021

This reminds me of the guy who said drop box was stupid because he could set up an ftp server. It’s the exact same argument.

People understand PDFs, they are extremely common in the academic and business world as “digital paper” standalone documents. Hypothetically, anything in memory can be made into a file but in this scenario what matters is the practical goal of people actually using these files.

I think it makes sense for the web to be made up of discreet primitives not only so that the web can be browsed in an intuitive and frictionless way but also because it lends itself to being backed up and easily re-hosted.

pajko · on July 19, 2021

This. Also who hates the huge double margins? The slow rendering? The unnatural break-up of text? Meaningless headers and footers? And the whole page-based layout? PDF is not meant for the web. Period.

goodpoint · on July 19, 2021

You seem to miss the point of the post:

----

Call to action

Publish in static file formats

Date and hash your work

Stop spying on your users

----

All this cannot be GUARANTEED by HTML/pdf/epub and requires active cooperation from the author. This is bad.

Koshkin · on July 19, 2021

All true. Incidentally, I do not see pagination as necessary or in most cases even desirable; rather, I see it as a vestige of the printing technology, while the need for printing has shrunk dramatically over the past 20 years.

marcosdumay · on July 19, 2021

> PDFs don't have any dynamic interaction

Oh, you are set for a world of surprises. Nearly every single one bad, but running our current web over PDFs is well within the specs.

majkinetor · on July 19, 2021

PDF

- does not reflow, major suck

- is binary format, another major suck

So no thx, PDF is outdated tech, while HTML and friends are just abused.

anigbrowl · on July 19, 2021

What I like best about pdf files is that I can just give them to someone and be almost certain that any questions will be about the content rather than the format of the file.

gunapologist99 · on July 19, 2021

agreed.

and, ancient HTML can still be easily read by modern browsers, so that's not exactly a special attribute of PDF either.

anigbrowl · on July 19, 2021

HTML can easily be offline-able.

Sure - if the publisher cares. From the user's standpoint, the safe assumption is that they don't. Of course PDF is No Good for many contexts, but for any sort of long-form document that is primarily meant to be read, it's so often better.

Also, if something is available in pdf, I can be moderately sure that someone else took the time to make sure it would be formatted correctly and print out OK.* If it only exists in HTML it's more of a roulette wheel experience.

* Unless some graphic designer thought 'gee this report would look so cool if the cover pages were black or some other highly saturated block of solid color.'

baybal2 · on July 19, 2021

HTML used to be a very nice format at the age of xhtml 1.1, very formally specified, and a tie with DOM was assured by vert strictly standardised DOM v3. And ACID3 was giving you a pixel for pixel repeatability during rendering.

HTML+JS today... now it's effectively a standard in name only, and Chrome is the new IE6. The standard is now "what has worked in the last stable release"

Now go to http://acid3.acidtests.org/ and see how the latest stable Chrome release can't render a decade old CSS testcase.

ChrisMarshallNY · on July 19, 2021

> Simply build your website with pagination.

My experience is that browsers are terrible with CSS pagination support in their display and printing directly.

The only place it seems to actually work is...saving as a PDF...

grishka · on July 19, 2021

PDFs aren't really meant to be read off a screen, they're much better suited for stuff that's meant to be printed out.

And you can have a single self-contained file with a webpage, it's called a "web archive", with .mhtml extension.

Tomte · on July 19, 2021

> Base64 your images […], put your CSS in the HTML page

Is there a tool that does those two things (or at least the first one) and that can be used by non-programmers (command line use is fine, a Python library would not be)?

gildas · on July 19, 2021

You can use SingleFile for this, see https://github.com/gildas-lormeau/SingleFile/

1vuio0pswjnm7 · on July 19, 2021

"I come to hacker news to engage with thinkers, not just read a published article from a single author."

And how many websites today are anything like HN, in terms of relative simplicity, e.g., no images^1, 3rd party requests or ads, only a tiny bit of (gratuitous)^2 JS.

1. I do not particpate in the voting scheme but I could vote from the command line if I wanted to. I use a text-only browser so the grey, fading text gimmick is irrelevant. I see all comments and treat them according to the thinking not the voting.

2. If we exclude the .ico and a .gif

There seems to be a double-standard, for lack of a better term, where many HN commenters and voters appear to work for companies that make websites with tracking and ads and various gimmicks targeted at "non-thinkers" which are nothing at all like HN. Whatever these commenters and voters see and appreciate in HN they are not working to bring it to the rest of the web. I seriously doubt they comment and vote on HN out of fear of so-called "power users" or a belief that the HN type of simplicity could become more popular and threaten their jobs that depend on surveillance, online ads and a non-thinking audience of "powerless" users. Rather, a more rational explanation might be that they see some value in a website that shows no ads and generally uses no gimmicks; that's something to think about.

"PDF web" may not make sense to many folks who have invested heavily in JS and Big Tech web browsers, but Postscript is arguably more elegant than Javascript. "Thinkers" usually like FORTH.

https://en.m.wikipedia.org/wiki/Display_PostScript

The tracking section mentions the Abe Vigoda status page.

http://www.abevigoda.com/

kemitche · on July 19, 2021

PDFs are also horrible to view on mobile, as the text doesn't reflow.

novok · on July 19, 2021

Sounds a lot like epub.

stjohnswarts · on July 19, 2021

so because someone chooses to publish their website in an open format that they prefer "it's dumb" because they don't agree with you.

petercooper · on July 19, 2021

In a sea of cynicism, I gotta say.. bravo. This genuinely put a smile on my face. It has a lot of problems, sure, but it's a creative use of the Web and would surely work for some use cases. It's certainly no worse than using Flash ever was.

It reminds me a bit of a "newsletter" I'm subscribed to called, ironically, "Not a Newsletter" (http://notanewsletter.com/). You get an email from the author each month and it just points to a Google Doc where he puts the actual content. Why's this good? The content can't set off any spam filters, he can edit the issue after it's "sent" if there are mistakes or broken links..

sneak · on July 19, 2021

The content can be censored arbitrarily by google, and when you click on mobile web with the docs app installed, it logs your logged in google account identity (maybe for work?) with the view when it switches to the app.

Files have none of these problems.

petercooper · on July 19, 2021

You're not wrong! It always a trade off of one set of problems for another with these sorts of things, I guess.

indigochill · on July 19, 2021

If the author was concerned about getting censored by Google or feeding their data empire, they could set up a self-hosted Google Docs-like, like NextCloud.

The readers would still need to trust the author's not doing anything nefarious with their IP addresses, but I guess there's a degree of implicit trust when subscribing to a newsletter.

noduerme · on July 19, 2021

I would just put it on my own server. Are people really worried about clicking a private link and having their IP address logged? Just opening an email with a tracking pixel triggers that already, and you have to assume clicking a link will log your IP whether with Google or Constant Contact or any other mass email provider.

nonameiguess · on July 19, 2021

Google Docs are still files. It's just up to the author (or even the readers) to keep copies outside of Google's servers. Unless Lab6 owns their own servers, whoever is hosting these pdfs can delete them as well. At least, in both cases, static files are much easier to backup and copy than entire three-tier dynamic applications. And readers can keep their own copies separate from the original, which isn't possible with an application at all.

lmm · on July 19, 2021

> Google Docs are still files. It's just up to the author (or even the readers) to keep copies outside of Google's servers.

No they're not? You literally can't have a google doc as a file in a first-class way - you can export it to a file, but that's a lossy process.

noduerme · on July 19, 2021

Yup. Another way to say it is Google will release a file format the day offline computing drops dead. It should probably amount to an antitrust case or at least a major class action claim at this point. That said, even with PDF specs it's freakin impossible to read/write that format in an intelligible way, if the person creating the document used even the barest amount of block alignment. Adobe started with an innovative notion about layout, but ended up making content extremely hard to parse, and actually tried to open source the engine. Google started with an idea of trapping everyone's data in a format they'd never make fully available, and then charging for the privilege of storing it.

bmn__ · on July 19, 2021

It is too early to displace HTML with PDF.

> PDFs used to be inaccessible

My eyes are not very good. I have trouble reading the font in the PDF. I am using Firefox. HTML lets me pick that a font that I can read easily. I cannot do that with PDF.

> PDFs used to be unreadable on small screens, but now you can reflow them.

I am using Firefox. I cannot do that.

Realistically, how many years will I have to wait until Firefox catches up?

Over twenty years ago, I learnt Web authoring by examining the source which had a profound effect on my career. That serendipitous opportunity I had with human-readable sources will be lost to the next generation with PDF - they have to learn the technology deliberately.

simias · on July 19, 2021

My understanding is that PDF is a monster of a document format, and it's clearly not (usually and historically) meant to be reflowed. Even copy/pasting from PDFs can be very disconcerting because the viewer may not have a good idea of where blocks of text start and end (or even what the characters really are).

I can empathize with the feeling that the web is incredibly bloated, but that's IMO throwing the baby with the bath water. Simple HTML with some optional CSS would do the job much better IMO (and can be easily downloaded, mirrored or offlined with tools like wget).

And if you really don't like writing HTML (I won't blame you) then there's always formats like markdown, org-mode and friends which can easily be converted to pretty much anything.

shuntress · on July 19, 2021

Dealing with PDFs (as in, coding a system that can import/export/display them) is more obnoxious than dealing with excel spreadsheets.

Unless your system is a PDF library (as in, you make the black-box dependency that other systems use to handle PDF exports), everything you do with PDFs will be through some annoying black-box dependency that is a pain to use.

Even relatively complex HTML is much more fun to work with than PDF.

Santosh83 · on July 19, 2021

As far as I know, it is nothing specific to Firefox. You can't set your own PDF font or reflow a non-reflowable PDF in any browser.

chrismorgan · on July 19, 2021

Brief investigation suggests reflow is a super-clumsy, ultra-coarse-grained view mode that is implemented by few clients, is not easy to access, is not well known, and is vastly inferior to what you can get on the web, especially as it’s basically text-only.

In Adobe Acrobat (and I’m guessing Adobe Reader): Choose View → Zoom → Reflow, and it turns everything into one column of nigh-unformatted text.

(Word looks like it may support it, but that could be more that it’s converted it to a Word document in some way and reflow-like functionality falls out of that naturally, though I imagine the tagging would help with the conversion; and someone in this thread mentions something called “Book Reader” supporting it.)

x86_64Ubuntu · on July 19, 2021

Source code for websites hasn't been readable for years. Reading a minimized JS document that has mauled the DOM is only slightly more readable than the structure of a PDF.

titzer · on July 19, 2021

> Over twenty years ago, I learnt Web authoring by examining the source

So did I. Now, it is impossible to reverse engineer the metric crapton of minified JS and CSS cryptoglyphics that comprise the modern web.

rollcat · on July 19, 2021

TBH it's a little bit like complaining you can't open a modern binary executable in a hex editor and learn programming from that. Days of doing your regular coding by writing direct machine code or assembly are (mostly) gone, and for the sake of advancing the craft, I'm (mostly) happy with it.

But I too wish the modern web was simpler. It took an evolutionary path of maintaining just enough backwards compatibility to only keep making things worse. Efforts like Gemini[1] bring some hope but I'm afraid the medium won't be flexible enough for much beyond personal blogs. But maybe that's for the better.

[1]: https://gemini.circumlunar.space; gemini://gemini.circumlunar.space

silon42 · on July 19, 2021

>It is too early to displace HTML with PDF. 'Never' will be too early.

>Realistically, how many years will I have to wait until Firefox catches up?

They should better improve reflow for HTML on small devices first. Focusing on PDF is a waste of resources.

zinekeller · on July 19, 2021

I mean, Firefox just follows the website's command to not format it as a mobile webpage, right? But a button to forcibly reflow is handy though.

marcosdumay · on July 19, 2021

The one piece of software that I know that lets you reflow PDFs is Calibre. And the results aren't great.

qznc · on July 19, 2021

At least it looks more beautiful than terminal-only Gemini sites.

https://en.m.wikipedia.org/wiki/Gemini_%28protocol%29

majewsky · on July 19, 2021

Gemini is as "terminal-only" as Markdown. Just because it's a text format first and foremost, does not mean that you can't display it nicely formatted. It's more like EPUB in that regard.

qznc · on July 20, 2021

Unfortunately, many Gemini sites expect a fixed-width font for alignment.

II2II · on July 19, 2021

Gemini sites are not terminal-only and the renderer can make it look beautiful (depending upon one's definition of beautiful). One example is Lagrange:

https://github.com/skyjake/lagrange

noduerme · on July 19, 2021

I read this entire document. If you've ever had to write a PDF-to-text parser - and God help you, I have - you will beg for Flash to come back as a web standard.

[edit] Generally though, I'm sympathetic with your point and it's kind of like why zines regained popularity in the 90s (and samizdat in the Soviet Union before that)... controlling your own publishing is a powerful idea. Anyone can do that though, without resorting to obscure formats, unless obfuscation is the point.

taftster · on July 19, 2021

  $> cat file.pdf | strings

Done. /s

boramalper · on July 19, 2021

Stop cat abuse! /s

    $> strings file.pdf

taftster · on July 20, 2021

  $> strings < file.pdf

?? /s/s

dredmorbius · on July 20, 2021

The Poppler library's pdftotext is remarkably effective.

millerm · on July 19, 2021

Yeah, 10 second load time, tiny text on a mobile device. No thanks. Sucks that people went for over-styling every site making everything painful to publish. I’d be happy with 90’s static HTML, and a few images when needed. I seek information, not “an experience”.

fbrchps · on July 19, 2021

Exactly my reaction to opening the site.

I had no idea what the content of the site was (besides the title from HN) and around the 50% download point, I had already lost interest. I'm clearly not the only one who loses interest this quick [0][1][2].

Also, as others have mentioned in root level comments, the design & layout of the content within is also severely lacking, which makes waiting for the load to occur even less worth it.

---

[0]: https://www.pingdom.com/blog/page-load-time-really-affect-bo... (2018)

[1]: https://blog.mozilla.org/metrics/2010/03/31/firefox-page-loa... (2010)

[2]: https://www.thinkwithgoogle.com/marketing-strategies/app-and... (I know it's Google, but to be fair they have more data on this than most other companies, despite their obvious desire to sell more of their product/services related to it.)

AlexAffe · on July 19, 2021

Exactly this. It is by the way one of the main reasons I initially stuck with HN. The lean UI, text based simplicity, efficiently conveying information had me instantly. I would sacrifize styling for speed anytime, everywhere.

Robotbeat · on July 19, 2021

On the contrary, I much prefer a small text on a mobile device to the reflowed text on a mobile device that we’re always forced to use. The PDF is also the same view as on a desktop, so if I look at it on another device, my spatial memory of where stuff is remains intact.

millerm · on July 19, 2021

Might as well just generate a PNG. The text is too small for me on a mobile device. PDFs main goal was print. The fonts are awful for the screen and no ability to reflow the text.

I can deal with things moving around, I don't need spatial memory for that. Just give good titles, headers, and indexes. Again, we can do this with simple HTML, embed images and styles. It's all there.

Unfortunately, as I mentioned, people don't really publish information anymore. It's mainly for "experience" and for "looks". Marketing, and advertising, now drive the information era. The "Information Super Highway" is now just a crumbling road plastered with billboards. Most content is useless, and is there for clicks. Heck, I'd rather someone post their site in digests in e-book formats than PDF.

emptyparadise · on July 20, 2021

Only 10 seconds!?

trhoad · on July 19, 2021

I just ran your PDF through an accessibility checker and it failed magnificently. For this reason alone, suggesting people make more use of PDFs instead of well-formatted HTML is a total non-starter for me (and should be for everyone).

Finnucane · on July 19, 2021

Making properly accessible PDFs is possible, but it is a pain in the ass. Certainly more difficult than with plain HTML.

robin_reala · on July 19, 2021

It’s entirely possible to write accessible PDFs. It’s just that no-one does.

trhoad · on July 19, 2021

It is indeed! And you're right, nobody does, including this example.

jfk13 · on July 19, 2021

And even if they did, many of the readers/viewers people use wouldn't fully support it.

While it's possible to royally mess up accessibility in HTML, too, the chances of getting something usable are at least somewhat better.

wy35 · on July 19, 2021

My thoughts exactly, I feel like it would be easier to write accessible webpages (given the wealth of accessibility tools).

john-doe · on July 19, 2021

Even Word documents are more accessible than PDFs.

zinekeller · on July 19, 2021

Heck, even PDFs produced by Word (or comparable FOSS editors) are so much better (except if you've done it incorrectly by "printing" it) than this particular one.

wccrawford · on July 19, 2021

I find it quite amusing that the author is railing against HTML at least in part because it's practically impossible to build a new web browser at this point, and then moves to PDF instead.

In my time working with PDFs, I've found that generating them in ways that can be read with the most popular PDF readers is cryptic and difficult, and even parsing the ones made from the most popular creators is hard.

I would definitely not pick PDF over HTML in regards to how easy it is to implement a good reader or writer.

And there's plenty of authoring tools for HTML already, so the "ecosystem already exists for PDF" doesn't track either.

Even the complaint about churn makes no sense to me, because there's no need to upgrade your tools constantly. If you're using something that produces good HTML today, it'll produce good HTML in a decade, too.

OTOH, if you have a problem that could be automated, you're a lot more likely to be able to create that tool for HTML than PDF, and it's quite likely that someone else already has for HTML, but not PDF.

TheFreim · on July 19, 2021

> In my time working with PDFs, I've found that generating them in ways that can be read with the most popular PDF readers is cryptic and difficult, and even parsing the ones made from the most popular creators is hard.

Both pdf readers on my phone can't read the pdf, so this is definitely an issue.

cochne · on July 19, 2021

As someone who works with PDFs a lot, please don't. PDFs are awful in every case except those which require a very precise visual layout. From reading the article, I do not see a single case in which PDF is superior to vanilla HTML.

duxup · on July 19, 2021

My kids school used to send links to google docs for their announcements, I hated it. I pretty much hate any system like that, it's purely extra steps on the web.

In both email, and the browser I'm already in a program that displays text and images and cool stuff. So then I'm just sent a link to someplace else that does the same thing?

So then what? Is it all just "pdf can do that too", but with extra steps...? I can print to PDF in most browsers if I want, but in this case it isn't a choice.

The idea that I might save and store the school emails or that website and somehow manage those files seems kinda self important in a way ... I don't mean that as a personal attack, just that this idea that they imagine me taking the time to do that with their content? When otherwise it could have just been an accessible web page? How many people care to do that?

If I'm visiting a website I'm almost certainly not interested in saving your content / managing it... almost never.

I'm a little lost on the whole 'page-oriented' idea too. That's just a limitation of paper, and it's a pain / disruptive more often than not. Even the 'page oriented' section is broken up by the page and some extra text at the bottom of the page that is irrelevant to the paragraph...

If folks want a 'save to pdf' option might be nice to add, or the user can just print to pdf...

MichalSternik · on July 19, 2021

Well, what's wrong with static site (generators)?

I certainly get the argument, but using something like hugo or gatsby or jekyll when you want to avoid the "churn" also seems like a perfectly valid solution.

nonameiguess · on July 19, 2021

The author addresses this pretty well. Because you can embed whatever you want, static site generators aren't really static. In particular, Jekyll blogs and what not still pretty commonly include comment sections.

Of course, pdfs aren't necessarily static, either, but that is why Lab6 is choosing to use pdf/a, an actually static format intended specifically for long-term archiving of immutable files. This way you can sign the file and guarantee it stays the same forever and everyone's copy is identical.

I'm kind of surprised at the response to this. The author seems well aware of how terrible pdf is as a format and this isn't some treatise of why we should want to use it. It's an unfortunate compromise that, given the requirements they're aiming to meet, of generating a file that supports rich formatting and hyperlink embedding, but which can guarantee immutability and long-term archiving directly in the spec, pdf/a is all there is, so in spite of being a terrible format with a lot of shortcomings, it's what they're using.

IshKebab · on July 19, 2021

Why don't they just use a static subset of HTML? You don't have to include comments sections, just like you don't have to include 3D CAD models and videos in your PDFs (yes you can do both of those, in theory anyway).

account42 · on July 19, 2021

> The author addresses this pretty well. Because you can embed whatever you want, static site generators aren't really static. In particular, Jekyll blogs and what not still pretty commonly include comment sections.

But just like you can choose to use PDF/A, you can also choose to have a completely static and self-contained (e.g. using data URLs for images) HTML page.

danShumway · on July 19, 2021

> pdf/a is all there is

Nobody is requiring you to use PDF/A. No mainline browser (that I'm aware of) requires it.

So what is being solved? When I click on a PDF on the web, I don't know if it's using PDF/A, I don't know if it's embedding or linking its fonts. So it's the same situation, nothing has changed.

Telling people to use PDF/A when most clients do not enforce it and when there's no indication to users before they click on a link whether or not the link is following the spec -- it is exactly the same as telling them to use a subset of HTML; the author is doing the same thing they complain about.

You can't just say that PDF/A exists. That's not enough, how will you get people to restrict themselves to that format when 99% of their users will never notice the difference and no client is enforcing it?

float4 · on July 19, 2021

The only thing I like about PDF compared to HTML is that with PDF, I know for a fact that no web requests are made in the background. That means no fingerprinting, no analytics etc.

With HTML, I have to trust that some random entity does what they state in their privacy policy, and they regularly don't. Sure, I can disable JS, but then 95% of the web doesn't work anymore.

Other than that PDF is quite clearly a less accessible format.

robin_reala · on July 19, 2021

How do you know for a fact? PDF has JS in the spec, and it supports SOAP and Web Services. Have a look at https://www.adobe.com/go/acrobatsdk_jsdevguide

float4 · on July 19, 2021

That's not the PDF spec is it? That is a spec for Adobe Acrobat, which is not allowed to make any web requests thanks to my application firewall (Little Snitch).

Pretty sure a PDF opened in the browser can't run any JS, but not completely sure. So you're right: I don't really know it for a fact. Poor choice of words.

robin_reala · on July 19, 2021

The spec is ISO 32000, and it’s expensive and closed, so difficult to reference. But according to Wikipedia at least, JavaScript is normative in it. No idea if SOAP / Web Services is part of it though.

jl6 · on July 19, 2021

The spec for PDF 1.7 is here: https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PD...

JavaScript is allowed, but not in PDF/A, which is what I use.

The PDF 2.0 spec is damnably not public.

the8472 · on July 19, 2021

But you can't easily tell PDF/A and regular PDF apart, so we're back to the same situation as HTML vs. HTML with javascript turned off.

grncdr · on July 19, 2021

Are you sure? I was under the impression that PDFs can reference web resources, and this is why there are more stringent standards for archiving (PDF/A and friends)

account42 · on July 19, 2021

> With HTML, I have to trust that some random entity does what they state in their privacy policy, and they regularly don't. Sure, I can disable JS, but then 95% of the web doesn't work anymore.

If you only allow PDF, then 99.9999% of the web doesn't work anymore.

I'm all for getting sites to be static, but PDF doesn't fix that because the problem has never been the technology used to build the site.

jefftk · on July 19, 2021

How sure are you that there are no network requests happening? I tried to look this up and wasn't able to find any clear answer.

(It looks like at least some PDF readers have provided support for automatically displaying external images, for example)

foobar33333 · on July 19, 2021

The full PDF spec is insane and allows for web requests and javascript. Most readers do not implement the anti features but adobe's tools will.

deregulateMed · on July 19, 2021

You are fingerprinted when you find the web link.

float4 · on July 19, 2021

When I click a link you mean? Definitely true, but that way they only have access to my IP and user agent, which is still better than all the WebGL, Font library, display calibration settings, mouse movement etc. that they use otherwise.

I often use Tor, although I'm pretty sure that even then, a good analytics lib can see it's me based on scroll behaviour, mouse movement, time of day, and of course what I browse.

But yeah, you make a good point.

deregulateMed · on July 19, 2021

Where do you get the link?

float4 · on July 19, 2021

DDG mostly, and they don't track users.

deregulateMed · on July 19, 2021

Your device, your device version, screen size, browser, browser version, IP address, etc... Are all tracked regardless.

You might not be a unique fingerprint, but at best you are part of a group of somewhere between 3 and 1000 similar users.

Not to be a downer, but when I webscraped I learned that big corporations can spend money to fingerprint you.