- "If my observations are correct, each segment has its own arithmetic decoder state, and thus can be decoded in its own thread."
Yeah, but real-world PDF JBIG2's seem to usually have one segment! One of the first things I checked—they wouldn't have made it that easy, the world's too cruel.
It's sort of a generic problem with compression formats—lots of files could easily be multiple segments that decompress in parallel, but aren't—if people don't encode them in multiple segments, you can't decompress them in multiple segments. Most formats support something like that in the spec, but most tools either don't implement that, or don't have it as the default.
e.g. https://news.ycombinator.com/item?id=33238283 ("pigz: A parallel implementation of gzip for multi-core machines" —fully compatible the gzip format and with gzip(1)! No one uses it).
Yikes! Doesn't seem like there's anything that can be done to solve that then.
I guess the only way to tackle it would be to target the popular software or libraries for producing PDFs to begin with and try to upstream parallel encoding into them.
Or is it possible to "convert" existing PDFs from single-segment to multi-segment PDFs, to make for faster reading on existing software?
Conversion's a very good solution for files you're storing locally! I'm working on polishing a script workflow to implement this—I haven't figured out which format to store things into yet. I don't consider it a full solution to the problem—more of a bandaid/workaround.
The downside is that any PDF conversion is a long-running batch job, one that probably shouldn't be part of any UX sequence—it's way too slow.
Emacs' PDF reader does something like this: when it loads a pdf, its default behavior is to start a background script that converts every page into a PNG, which decodes much more quickly than typical PDF formats. (You can start reading the PDF right away, and the end of the conversion, it becomes more responsive). I think it's a questionable design choice: it's a high-CPU task during a UI interaction, and potentially a long-running one, for a large PDF. (This is why I was profiling libpng, incidentally).
Yeah, but real-world PDF JBIG2's seem to usually have one segment! One of the first things I checked—they wouldn't have made it that easy, the world's too cruel.
It's sort of a generic problem with compression formats—lots of files could easily be multiple segments that decompress in parallel, but aren't—if people don't encode them in multiple segments, you can't decompress them in multiple segments. Most formats support something like that in the spec, but most tools either don't implement that, or don't have it as the default.
e.g. https://news.ycombinator.com/item?id=33238283 ("pigz: A parallel implementation of gzip for multi-core machines" —fully compatible the gzip format and with gzip(1)! No one uses it).