> This remains a long-standing pet peeve of mine. PDFs like this are horrible to read on mobile phones, hard to copy-and-paste from ...
I've never understood why copying text from digitally native PDFs (created directly from digital source files, rather than by OCR-ing scanned images) is so often such a poor experience. Even PDFs produced from LaTex often contain undesirable ligatures in the copied text like fi and fl. Text copied from some Springer journals sometimes lacks space between words or introduces unwanted space between letters in a word ... Is it due to something inherent in PDF technology?
> Is it due to something inherent in PDF technology?
Exactly. PDF doesn't have instructions to say "render this paragraph of text in this box", it has instructions to say "render each of these glyphs at each of these x,y coordinates".
It was never designed to have text extracted from it. So trying to turn it back into text involves a lot of heuristics and guesswork, like where enough separation between characters should be considered a space.
A lot also depends on what software produced the PDF, which can make it easier or harder to extract the text.
I've never looked into the PDF format, but, does it not allow for annotations that say, "the glyphs in the rectangle ((x0, y0), (x1, y1)) represent the text 'foobar'")? That's been my mental model for how they are text-searchable.
PDF natively supports selectable/extractable text. Section 9.10 of ISO 32000 is literally “Extraction of Text Content.” I’ve implemented it myself in production software.
There are many good reasons why PDF has a “render glyph” instruction instead of a “render string”. In particular your printer and your PDF viewer should not need to have the same text shaping and layout algorithms in order for the PDF to render the same. Oops, your printer runs a different version of Harfbuzz!
The sibling comment is right that a lot depends on the software that produced the PDF. It’s important to be accurate about where the blame lies. I don’t blame the x86 ISA or the C++ standards committee when an Electron app uses too much memory.
It’s due to poor choices made in the implementation of pdfTeX. For example the TeX engine does not associate the original space characters with the inter-word “glue” that replaces them, so pdfTeX happily omits them. This was fixed a few years back, finally. But there’s millions(?) of papers out there with no spaces.
ligatures like fi fl ffi ffl etc are for changes in fonts specific to rendering correctly on a screen or printer. It's intended to be a _rendered_ format, rather than a parse-able format.
Well formatted epub and HTML generally are usually intended to update to end user needs and better fit available layout space.
Though it's also a stuck legacy throwback. Modern advice would be to not send ligatures directly to the renderer and instead let the renderer poll OpenType features (and Unicode/ICU algorithms) to build them itself. PDF's baking of some ligatures in its files seems something of a backwards compatibility legacy mistake to still support ancient "dumb" PostScript fonts and pre-Unicode font encodings (or least pre-Unicode Normalization Forms). It's also a bit of the fact that PDF has always been confused about if it is the final renderer in a stack or not.
That wouldn’t work for PDF’s use case of being an arbitrary paper-like format because the various Unicode and OpenType algorithms don’t provide sufficient functionality for rendering arbitrary text: there are no one-size-fits all rules! The standards are a set of generic “best effort” guidelines for lowest-common-denominator text layout that are constantly being extended.
Even for English the exact tweaking of line breaking and hyphenation is a problem that requires manual intervention from time to time. In mathematics research papers it’s not uncommon to see symbols that haven’t yet made it into Unicode. Look at the state of text on the web and you’ll encounter all these problems; even Google Docs gave in and now renders to a canvas.
PDF’s Unicode handling is indeed a big mess but it does have the ability to associate any glyph with an arbitrary Unicode string, for text extraction purposes, so there’s nothing to stop the program that generates the PDF from mapping the fi ligature glyph to the to-character string “fi”.
I think you are seeing different problems here than I was complaining about. Maybe I can restate the case: Baseline PDF 1.0 is something like (but not exactly) rendering to print to a specific PostScript printer that understands embedded PostScript fonts, somewhat like but not exactly like virtual version of an early Apple LaserWriter. PDF has been extended over the versions and included extensions and upgrades over the years and now that target "printer" that PDF represents has upgraded and also understands Unicode in its PostScript and also understands embedded OpenType fonts (and their many extensions of character mapping and ligatures and contextual alternates, etc). But because its legacy was a dumber printer for the easiest "backwards compatibility" a lot of PDF rendering apps still do dumb things like encode ligatures "by hand" in quaint encodings like some of the extended ASCII code pages or the pre-combined Unicode forms that today we consider obsolete as if they were printing to a PostScript printer that doesn't understand ligatures directly because it is still only capable of printing older PostScript fonts.
Yes, if you don't embed your fonts (or at least their metrics) in the PDF layout is less deterministic and will shift from font to font. The point is that we can embed modern fonts in PDF, the virtual printer has upgraded support for that, but for all sorts of reasons some of the tools that build PDF are still acting like they are "printing" to the lowest common denominator and using legacy EBCDIC ligature encodings in 2024. (Fun fact: Unicode's embeddings of the ligatures are somewhat closer to some classic EBCDIC report printing code pages than Extended ASCII's equivalents because there was a larger backwards compatibility gulf there.)
Agreed - I used CSS to lay out a book a couple of years ago and it wasn't too bad, but the things that have poor support/don't work at all (like page numbers) are a pain to hack around.
If a PDF doesn't support text extraction, it's the fault of the software that created it. Most likely the software didn't include the glyph → Unicode character mapping in the PDF.
I've never understood why copying text from digitally native PDFs (created directly from digital source files, rather than by OCR-ing scanned images) is so often such a poor experience. Even PDFs produced from LaTex often contain undesirable ligatures in the copied text like fi and fl. Text copied from some Springer journals sometimes lacks space between words or introduces unwanted space between letters in a word ... Is it due to something inherent in PDF technology?