PDF files, and why the heck they are so slow to read. Hours upon hours of perf(1...

sparkie · on April 23, 2024

> The worst offender (so far) is the JBIG2 format (several major libraries, including jbig2dec), a very popular format that gets EXTREMELY high compression ratios on bilevel images of types typical to scanned pdfs. But: it's also a format that's pretty slow to decompress—not something you want in a UI loop, like a PDF reader is! And, there's no way around that—if you look at the hot loop, which is arithmetic coding, it's a mess of highly branchy code that's purely serial and cannot be thread- nor SIMD- parallelized.

Looking at the jbig2dec code, there appears to be some room for improvement. If my observations are correct, each segment has its own arithmetic decoder state, and thus can be decoded in its own thread. The main reader loop[1] is basically a state machine which attempts to load each segment in sequence[2], but it should not need to. The file has segment headers which contains the segments offsets and sizes. It should be possible to first decode the header and populate the segment headers, then spawn N-threads to decode N-segments in parallel. Obviously, you don't want the threads competing for the file resource, so you could load each segment into its own buffer first, or mmap the whole file into memory.

[1]:https://github.com/ArtifexSoftware/jbig2dec/blob/master/jbig...

[2]:https://github.com/ArtifexSoftware/jbig2dec/blob/master/jbig...

perihelions · on April 23, 2024

- "If my observations are correct, each segment has its own arithmetic decoder state, and thus can be decoded in its own thread."

Yeah, but real-world PDF JBIG2's seem to usually have one segment! One of the first things I checked—they wouldn't have made it that easy, the world's too cruel.

It's sort of a generic problem with compression formats—lots of files could easily be multiple segments that decompress in parallel, but aren't—if people don't encode them in multiple segments, you can't decompress them in multiple segments. Most formats support something like that in the spec, but most tools either don't implement that, or don't have it as the default.

e.g. https://news.ycombinator.com/item?id=33238283 ("pigz: A parallel implementation of gzip for multi-core machines" —fully compatible the gzip format and with gzip(1)! No one uses it).

sparkie · on April 23, 2024

Yikes! Doesn't seem like there's anything that can be done to solve that then.

I guess the only way to tackle it would be to target the popular software or libraries for producing PDFs to begin with and try to upstream parallel encoding into them.

Or is it possible to "convert" existing PDFs from single-segment to multi-segment PDFs, to make for faster reading on existing software?

perihelions · on April 23, 2024

Conversion's a very good solution for files you're storing locally! I'm working on polishing a script workflow to implement this—I haven't figured out which format to store things into yet. I don't consider it a full solution to the problem—more of a bandaid/workaround.

The downside is that any PDF conversion is a long-running batch job, one that probably shouldn't be part of any UX sequence—it's way too slow.

Emacs' PDF reader does something like this: when it loads a pdf, its default behavior is to start a background script that converts every page into a PNG, which decodes much more quickly than typical PDF formats. (You can start reading the PDF right away, and the end of the conversion, it becomes more responsive). I think it's a questionable design choice: it's a high-CPU task during a UI interaction, and potentially a long-running one, for a large PDF. (This is why I was profiling libpng, incidentally).

https://www.gnu.org/software/emacs/manual/html_node/emacs/Do...

698969 · on April 23, 2024

> a very popular format that gets EXTREMELY high compression ratios on bilevel images of types typical to scanned pdfs

Funny you say that, https://en.wikipedia.org/wiki/JBIG2#Character_substitution_e...

cocodill · on April 23, 2024

I can still remember this cool talk about that. https://www.youtube.com/watch?v=7FeqF1-Z1g0

eddd-ddde · on April 23, 2024

Question since you are probably knowledge-able about it right now.

> Another observation: lots of people create PDF's at print-quality pixel density that's useless for screens, and greatly increases rendering latency.

Is this relevant to text in the PDF? I would assume text is vectorized, meaning resolution is not relevant until you _actually_ print it?

Or is it just relevant to rasterized content like embedded images?

perihelions · on April 23, 2024

Your understanding's right: PDF's that are text + fonts are easy and fast. I'm concerned about the other kind, that's scanned pages. Any sheet music from Petrucci / imslp.org for one example. That kind is a sequence of raster images, stored in compressed-image formats which most people aren't familiar with, because they're specialized to bi-level (1-bit, black and white) images. A separate class from photo-type images. The big two seem to be JBIG2 [0], and CCITT Group 4 [1], which was standardized for fax machines in the 1980's (and still works well!)

[0] https://en.wikipedia.org/wiki/JBIG2

[1] https://en.wikipedia.org/wiki/Fax#Modified_Modified_READ

(You can examine this stuff with pdfimages(1)—or just rg -a for strings like /JBIG2Decode or /CCITTFaxDecode and poke around).

ale42 · on April 23, 2024

Personally I have the impression that CCITT group 4 compressed PDFs are displayed very quickly, unless they are scanned at 3000 DPI... Can't say the same for JBIG2 or JPEG/JPEG2000 based ones.

jancsika · on April 23, 2024

> A separate class from photo-type images.

I'd assume that the photo-type image decoder is optimized, right? If so, how does the optimized photo-type decoder compare to the apparently unoptimizable JBIG2 decoder?

perihelions · on April 23, 2024

I'm not knowledgeable to speak to that, but just to clarify—the low-hanging fruit in libpng I mentioned is in simple, vectorizable loops—conversions between pixel formats in buffers. Not in its compression algorithm (which isn't part of libpng—it calls out to zlib for that).

throw0101c · on April 23, 2024

> I know of nowhere to submit this patch.

How about the folks listed as "Authors":

* http://www.libpng.org/pub/png/libpng.html

> Why the heck is everyone using libpng?

What is the alternative(s)?

emeril · on April 23, 2024

sumatrapdf seems better than most at reading them?

masteruvpuppetz · on April 23, 2024

is there any ffmpeg like command line program for pdfs? Creating, Appending/Removing pages, viewer, etc?

pentacent_hq · on April 23, 2024

For appending, removing, merging pages, there’s pdftk: https://www.pdflabs.com/tools/pdftk-server/

xk3 · on April 23, 2024

gs (ghostscript), mutool, ocrmypdf...

To add/remove: mutool merge -h

To split PDF pages: mutool poster -h

I made a script here that I use frequently for scanned documents: https://github.com/chapmanjacobd/computer/blob/main/bin/pdf_...