Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

PDF files, and why the heck they are so slow to read. Hours upon hours of perf(1) and fiddling with ugly things in C. My main takeaway is everyone in the world is doing things HORRIBLY wrong and there's no way to stop them.

(Digression: did you know libpng, the one everyone uses, is not supposed to be an optimized production library—rather it's a reference implementation? It's almost completely unoptimized, no really, take a look, anywhere in the codebase. Critical hot loops are 15 year old C that doesn't autovectorize. I easily got a 200% speedup with a 30-line patch, on something I cared about (their decoding of 1-bit bilevel to RGBA). I'm using that modified libpng right now. I know of nowhere to submit this patch. Why the heck is everyone using libpng?)

The worst offender (so far) is the JBIG2 format (several major libraries, including jbig2dec), a very popular format that gets EXTREMELY high compression ratios on bilevel images of types typical to scanned pdfs. But: it's also a format that's pretty slow to decompress—not something you want in a UI loop, like a PDF reader is! And, there's no way around that—if you look at the hot loop, which is arithmetic coding, it's a mess of highly branchy code that's purely serial and cannot be thread- nor SIMD- parallelized. (Standardized in 2000, so it wasn't an obvious downside then). I want to try to deep-dive into this one (as best as my limited skill allows), but I think it's unlikely there's any low-hanging optimization fruit, like there's so much of in libpng. It's all wrong that everyone's using this slow, non-optimizable compression format in PDF's today, but, no one really cares. Everyone's doing things wrong and there is no way to stop them.

Another observation: lots of people create PDF's at print-quality pixel density that's useless for screens, and greatly increases rendering latency. Does JBIG2 support interlacing or progressive decoding, to sidestep this challenge? Of course it doesn't.

Everyone's doing PDF things wrong and there is no way under the blue sky to make them stop.



> The worst offender (so far) is the JBIG2 format (several major libraries, including jbig2dec), a very popular format that gets EXTREMELY high compression ratios on bilevel images of types typical to scanned pdfs. But: it's also a format that's pretty slow to decompress—not something you want in a UI loop, like a PDF reader is! And, there's no way around that—if you look at the hot loop, which is arithmetic coding, it's a mess of highly branchy code that's purely serial and cannot be thread- nor SIMD- parallelized.

Looking at the jbig2dec code, there appears to be some room for improvement. If my observations are correct, each segment has its own arithmetic decoder state, and thus can be decoded in its own thread. The main reader loop[1] is basically a state machine which attempts to load each segment in sequence[2], but it should not need to. The file has segment headers which contains the segments offsets and sizes. It should be possible to first decode the header and populate the segment headers, then spawn N-threads to decode N-segments in parallel. Obviously, you don't want the threads competing for the file resource, so you could load each segment into its own buffer first, or mmap the whole file into memory.

[1]:https://github.com/ArtifexSoftware/jbig2dec/blob/master/jbig...

[2]:https://github.com/ArtifexSoftware/jbig2dec/blob/master/jbig...


- "If my observations are correct, each segment has its own arithmetic decoder state, and thus can be decoded in its own thread."

Yeah, but real-world PDF JBIG2's seem to usually have one segment! One of the first things I checked—they wouldn't have made it that easy, the world's too cruel.

It's sort of a generic problem with compression formats—lots of files could easily be multiple segments that decompress in parallel, but aren't—if people don't encode them in multiple segments, you can't decompress them in multiple segments. Most formats support something like that in the spec, but most tools either don't implement that, or don't have it as the default.

e.g. https://news.ycombinator.com/item?id=33238283 ("pigz: A parallel implementation of gzip for multi-core machines" —fully compatible the gzip format and with gzip(1)! No one uses it).


Yikes! Doesn't seem like there's anything that can be done to solve that then.

I guess the only way to tackle it would be to target the popular software or libraries for producing PDFs to begin with and try to upstream parallel encoding into them.

Or is it possible to "convert" existing PDFs from single-segment to multi-segment PDFs, to make for faster reading on existing software?


Conversion's a very good solution for files you're storing locally! I'm working on polishing a script workflow to implement this—I haven't figured out which format to store things into yet. I don't consider it a full solution to the problem—more of a bandaid/workaround.

The downside is that any PDF conversion is a long-running batch job, one that probably shouldn't be part of any UX sequence—it's way too slow.

Emacs' PDF reader does something like this: when it loads a pdf, its default behavior is to start a background script that converts every page into a PNG, which decodes much more quickly than typical PDF formats. (You can start reading the PDF right away, and the end of the conversion, it becomes more responsive). I think it's a questionable design choice: it's a high-CPU task during a UI interaction, and potentially a long-running one, for a large PDF. (This is why I was profiling libpng, incidentally).

https://www.gnu.org/software/emacs/manual/html_node/emacs/Do...


> a very popular format that gets EXTREMELY high compression ratios on bilevel images of types typical to scanned pdfs

Funny you say that, https://en.wikipedia.org/wiki/JBIG2#Character_substitution_e...


I can still remember this cool talk about that. https://www.youtube.com/watch?v=7FeqF1-Z1g0


Question since you are probably knowledge-able about it right now.

> Another observation: lots of people create PDF's at print-quality pixel density that's useless for screens, and greatly increases rendering latency.

Is this relevant to text in the PDF? I would assume text is vectorized, meaning resolution is not relevant until you _actually_ print it?

Or is it just relevant to rasterized content like embedded images?


Your understanding's right: PDF's that are text + fonts are easy and fast. I'm concerned about the other kind, that's scanned pages. Any sheet music from Petrucci / imslp.org for one example. That kind is a sequence of raster images, stored in compressed-image formats which most people aren't familiar with, because they're specialized to bi-level (1-bit, black and white) images. A separate class from photo-type images. The big two seem to be JBIG2 [0], and CCITT Group 4 [1], which was standardized for fax machines in the 1980's (and still works well!)

[0] https://en.wikipedia.org/wiki/JBIG2

[1] https://en.wikipedia.org/wiki/Fax#Modified_Modified_READ

(You can examine this stuff with pdfimages(1)—or just rg -a for strings like /JBIG2Decode or /CCITTFaxDecode and poke around).


Personally I have the impression that CCITT group 4 compressed PDFs are displayed very quickly, unless they are scanned at 3000 DPI... Can't say the same for JBIG2 or JPEG/JPEG2000 based ones.


> A separate class from photo-type images.

I'd assume that the photo-type image decoder is optimized, right? If so, how does the optimized photo-type decoder compare to the apparently unoptimizable JBIG2 decoder?


I'm not knowledgeable to speak to that, but just to clarify—the low-hanging fruit in libpng I mentioned is in simple, vectorizable loops—conversions between pixel formats in buffers. Not in its compression algorithm (which isn't part of libpng—it calls out to zlib for that).


> I know of nowhere to submit this patch.

How about the folks listed as "Authors":

* http://www.libpng.org/pub/png/libpng.html

> Why the heck is everyone using libpng?

What is the alternative(s)?


sumatrapdf seems better than most at reading them?


is there any ffmpeg like command line program for pdfs? Creating, Appending/Removing pages, viewer, etc?


For appending, removing, merging pages, there’s pdftk: https://www.pdflabs.com/tools/pdftk-server/


gs (ghostscript), mutool, ocrmypdf...

To add/remove: mutool merge -h

To split PDF pages: mutool poster -h

I made a script here that I use frequently for scanned documents: https://github.com/chapmanjacobd/computer/blob/main/bin/pdf_...




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: