Though it's also a stuck legacy throwback. Modern advice would be to not send li...

jahewson · 2024-08-29T18:51:32 1724957492

That wouldn’t work for PDF’s use case of being an arbitrary paper-like format because the various Unicode and OpenType algorithms don’t provide sufficient functionality for rendering arbitrary text: there are no one-size-fits all rules! The standards are a set of generic “best effort” guidelines for lowest-common-denominator text layout that are constantly being extended.

Even for English the exact tweaking of line breaking and hyphenation is a problem that requires manual intervention from time to time. In mathematics research papers it’s not uncommon to see symbols that haven’t yet made it into Unicode. Look at the state of text on the web and you’ll encounter all these problems; even Google Docs gave in and now renders to a canvas.

PDF’s Unicode handling is indeed a big mess but it does have the ability to associate any glyph with an arbitrary Unicode string, for text extraction purposes, so there’s nothing to stop the program that generates the PDF from mapping the fi ligature glyph to the to-character string “fi”.

WorldMaker · 2024-08-29T19:43:38 1724960618

I think you are seeing different problems here than I was complaining about. Maybe I can restate the case: Baseline PDF 1.0 is something like (but not exactly) rendering to print to a specific PostScript printer that understands embedded PostScript fonts, somewhat like but not exactly like virtual version of an early Apple LaserWriter. PDF has been extended over the versions and included extensions and upgrades over the years and now that target "printer" that PDF represents has upgraded and also understands Unicode in its PostScript and also understands embedded OpenType fonts (and their many extensions of character mapping and ligatures and contextual alternates, etc). But because its legacy was a dumber printer for the easiest "backwards compatibility" a lot of PDF rendering apps still do dumb things like encode ligatures "by hand" in quaint encodings like some of the extended ASCII code pages or the pre-combined Unicode forms that today we consider obsolete as if they were printing to a PostScript printer that doesn't understand ligatures directly because it is still only capable of printing older PostScript fonts.

Yes, if you don't embed your fonts (or at least their metrics) in the PDF layout is less deterministic and will shift from font to font. The point is that we can embed modern fonts in PDF, the virtual printer has upgraded support for that, but for all sorts of reasons some of the tools that build PDF are still acting like they are "printing" to the lowest common denominator and using legacy EBCDIC ligature encodings in 2024. (Fun fact: Unicode's embeddings of the ligatures are somewhat closer to some classic EBCDIC report printing code pages than Extended ASCII's equivalents because there was a larger backwards compatibility gulf there.)