The QOI File Format Specification

fleabitdev · on Dec 20, 2021

I've dealt with image decoding for a game engine before. The images in question were large pixel-art texture atlases, stored as PNG files and loaded at startup. Their slow decoding speed caught me by surprise, given that the file format is 25 years old!

The reason turned out to be that Deflate's design makes it very hard to parallelise or SIMD-accelerate. Even the best available decoders are basically forced to process a byte at a time, in a straight line from start to finish, which obviously isn't a good match for modern CPUs. The 3x to 4x decompression speedup here is nice, but I can't help but feel a little sad about how poorly this format is taking advantage of available compute resources. (The ultimate dream would be a format which is so parallel-friendly that you can just send the binary blob to the GPU and decompress it in a compute shader!)

Even a rule like "each row is compressed separately, with a table of row lengths at the beginning of the file" might have been enough - this would have made compression ratios slightly worse, but complexity wouldn't have suffered too much. With only six different chunk types, we could perhaps imagine a branchless decoder where each row's decoding state is stored in its own SIMD lane, and the results for several rows are all emitted at the same time...

wongarsu · on Dec 20, 2021

Apple apparently ships a PNG encoder that completely flushes the deflate stream every now and then, giving you additional offsets where you can start decompressing without dependencies on previous data. The offsets are then saved as a separate chunk in the PNG. Decoders aware of this can parallelize decoding using these offsets, everyone else can read it as normal.

The proper solution would of course be to use something more modern than deflate, the PNG format even has a metadata field that specifies the used compression algorithm. But anything but deflate wouldn't be backwards compatible, so nobody seems to use that option.

gpvos · on Dec 20, 2021

The Apple encoder only splits the PNG in the middle.[0] This was linked from a recent HN post about a hack that exploits this to have PNGs that display differently in some situations on Apple machines.[1]

[0] https://www.hackerfactor.com/blog/index.php?/archives/895-Co...

[1] https://news.ycombinator.com/item?id=29573792

pcwalton · on Dec 20, 2021

> The 3x to 4x decompression speedup here is nice, but I can't help but feel a little sad about how poorly this format is taking advantage of available compute resources. (The ultimate dream would be a format which is so parallel-friendly that you can just send the binary blob to the GPU and decompress it in a compute shader!)

You can't usefully decompress the entire thing on the GPU, but you can do half of it there. You can decompress DEFLATE right before sending the data over the bus and do the PNG prediction (filtering) in a compute shader. I actually implemented this once (PNG prediction is amenable to wavefront parallelism) and it worked fine, though it wasn't any faster than using SIMD on the CPU because the fact that each row can specify a different prediction method means that you end up with a lot of divergence in the shader. Still, it could get some load off the CPU if your loading is CPU-bound.

asmar · on Dec 20, 2021

An interesting article regarding SIMD in libPNG (although not in deflate as complained about): https://optidash.ai/blog/the-case-of-the-missing-simd-code

ByThyGrace · on Dec 20, 2021

> you can just send the binary blob to the GPU and decompress it in a compute shader

Surely this exists already?

wolf550e · on Dec 20, 2021

I think texture compression is always lossy, so it's not directly comparable to this or to PNG. So I don't think it exists. See ASTC and BC7.

Texture compression can be very advanced: https://cbloomrants.blogspot.com/2020/06/oodle-texture-slash...

fleabitdev · on Dec 20, 2021

Reading that blog post, I was surprised to learn that many modern games dedicate more than half of their filesize to textures. I haven't played an AAA game in more than a decade, but I would have thought that meshes and (particularly) audio would use up more space.

It sounds like developers are stuck between a rock and a hard place. They need one of the specific compressed pixel formats which can be efficiently read by the GPU, but those formats are about ten times larger than a similar JPEG file, they don't losslessly compress well (modulo RAD Game Tools' discoveries here), and recompressing raw pixels to a GPU-addressable format at load time would be orders-of-magnitude too slow.

RAD Game Tools' approach here is clever, but it feels like a bit of a hack. The obvious next step would be a lossy compressed image format which can decompress directly to BC7, exploiting spatial-correlation tricks similar to those which PNG uses to get better results than a gzipped BMP file. Has anybody tried this already?

modeless · on Dec 20, 2021

I wouldn't call Oodle Texture a hack. But there's also https://github.com/BinomialLLC/crunch and https://github.com/BinomialLLC/basis_universal. The latter has the advantage that it can be decoded to multiple different compressed texture formats, so that all GPUs can be supported from a single file (different GPUs support different formats so you can't necessarily send the same compressed texture to any GPU).

fleabitdev · on Dec 20, 2021

Exactly what I was looking for, thanks! :)

robalni · on Dec 20, 2021

When looking at the benchmark result and image size, it seems like this format is good at images that have horizontal structures or where the same color is repeated often. It's not good at vertical structures, including images with enlarged pixels (where one row is identical to the one above). I think the main reason for that is that PNG has a filter pass that can look at the row of pixels above, while this format only looks to the left.

Example of image with vertical gradients: https://qoiformat.org/benchmark/images/icon_512/apps-prefere... - qoi compresses to 186kb and libpng to 25kb

Example of image with horizontal structure and repeated colors: https://qoiformat.org/benchmark/images/textures_pk/mod_walld... - qoi compresses to 21kb and libpng to 33kb

adgjlsfhk1 · on Dec 20, 2021

One thing worth noting is that QOI can be pretty easily wrapped with something else to change pixel traversal order. If you want higher compression, I'd recommend wrapping QOI with ZSTD (or similar) compression on the output, and something to change the pixel traversal order on the input.

Akronymus · on Dec 20, 2021

> and something to change the pixel traversal order on the input.

I could see a variety of space filling curves working for that.

Aissen · on Dec 20, 2021

The experiment that led to this file format (QOI: Lossless Image Compression in O(n) Time) was discussed on HN:

https://news.ycombinator.com/item?id=29328750

choeger · on Dec 20, 2021

This is going to be super useful for educational purposes. I'd really like to see more file formats like this.

There's already a bunch of minimal programming languages and even some operating systems. It's great value for learning stuff when you can implement a quite OK version of something in dozens of hours or less.

IshKebab · on Dec 20, 2021

This is pretty cool. Though...

> all data is now encoded as big-endian.

Why? If the goal is to be fast and simple then surely little endian would make more sense given that basically all processors are little endian. Big endian just means you need to swap everything twice.

Edit: Seems like I'm not the only person that thought this was a super weird choice: https://github.com/phoboslab/qoi/issues/36

The reasoning is:

> The file format for QOI will not change anymore.

mahkoh · on Dec 20, 2021

Looking at the spec, the only multi-byte data are the width/height in the header. So only two 4-byte swaps per file.

politician · on Dec 20, 2021

The link contains a link to the discussion on QOI v2, https://github.com/nigeltao/qoi2-bikeshed/issues.

Zababa · on Dec 20, 2021

I think refusing a 5% speedup to ensure that the format will not change fits really well with the goals of the project.

formerly_proven · on Dec 20, 2021

This is very nice.

If I had to nitpick something, I'd say choosing all-big-endian is kinda odd in 2021 but it's probably not a biggy even for something like this.

phoboslab · on Dec 20, 2021

Well, it's easier to read in a hex editor and seemed like the right thing to do for a "wire format".

The only places where it matters is the header's width/height and order of RGB(A) data. Storing RGBA as ABGR is not a great choice, imho. If you want to be endian independent, you have to read one byte at a time anyway (or swap).

jandrese · on Dec 20, 2021

ARGB would be nice if you want to integrate with Cairo.

wheybags · on Dec 20, 2021

Storing rgba as abgr is a terrible choice, but are you sure they're actually doing that?

phoboslab · on Dec 20, 2021

Sorry, I see how my previous comment was misleading. To clarify: No, QOI is not doing that. QOI stores RGBA as R,G,B,A because it is big endian.

Little endian would imply that RGBA is stored as A,B,G,R. Which was my argument against using LE for a file format.

Taywee · on Dec 20, 2021

I don't believe it would imply that, because those are independent elements. RGBA would still be stored as R, G, B, A unless you're considering each pixel as a single 32-bit integer. A C structure for the colors would even live in memory in the order you expect. Just like a string is still stored byte-by-byte even in little-endian contexts.

LE vs BE would only affect multi-byte integers and floats, not arrays or structures of single-byte elements.

mgaunard · on Dec 20, 2021

why are channels interleaved at all if speed of decompression was the goal?

user-the-name · on Dec 20, 2021

Because almost everywhere you would use data will expect it as interleaved. Separating the channels would require a second pass to re-interleave the data.

MisterTea · on Dec 20, 2021

Honestly no one should worry about endianess in 2021.

cassepipe · on Dec 20, 2021

https://justine.lol/endian.html

MisterTea · on Dec 20, 2021

Yup, as soon as you see anything that checks platform endianess, uses memcpy, or assumes memory layout, run.

thechao · on Dec 20, 2021

Support for 16/32b and fp-formats would've been nice -- the compressor is actually orthogonal to the bit width of the channels. Custom fp-compression usually just unvolves moving the sign bit into the LSB, and delta encoding the mantissa separately from the exponent -- although the latter isn't as important.

adgjlsfhk1 · on Dec 20, 2021

This was my initial thought, but given how the format is designed, you would pretty much have to make a whole new set of tags for 16 or 32 bit images.

thechao · on Dec 20, 2021

Yeah... seems like a small family would be in the right vein. I'm immediately drawn to a block format, but that'd probably double the complexity.

gpvos · on Dec 20, 2021

The bytes in the compressed data aren't going to be aligned to 2-, 4- or 8-byte offsets, so the overhead is going to be nonexistent or negligible.

ErikCorry · on Dec 20, 2021

I experimented with adding this to the embedded system Toit. I wrote up the experience here https://medium.com/@erik_68861/trying-out-the-quite-ok-image... and also used it for this little demo on a 2 inch screen: https://twitter.com/erikcorry/status/1470023885026467848

It was pretty good, but in the end I don't think I'll be making it part of the graphics system. I would really like a bit of random access into the image without having to divide it up into multiple tiles. And the spec took a direction that didn't really suit me - i wanted to use it for Gui-like textures which have slabs of colours and anti-aliased edges with varying alpha. In the final version of the spec there's no way to code "the alpha changed, but the color is the same". without spending 5 bytes on a single pixel. Previously that took 2 bytes.

So I'm still looking for a format with:

* Very small amount of decode state (GIF's 16k is probably too much - PNG's 32k is certainly too much).

* Fast decode (these embedded chips are not as fast as desktops).

* Alpha channel (not just on-off like GIF)

* A bit of random access like the parallel PNG, but hopefully less bolted-on.

modeless · on Dec 20, 2021

Have you considered compressed texture formats, perhaps DXT5? Fixed compression ratio (1 byte per pixel for DXT5), arbitrary random access, decompress the pixels you need on the fly. Can be further compressed for efficient storage to about 1-2 bits per pixel with https://github.com/BinomialLLC/crunch.

ErikCorry · on Dec 20, 2021

This is pretty neat. Simple enough to decode, and very simple to do random access. And if we ever port to something with a GPU it's might even be supported. But I worry it might be too visibly lossy, especially for GUI-ish textures.

modeless · on Dec 21, 2021

It's certainly not going to be perceptually lossless for every image, but good compressors can do pretty well. There's a big difference in quality between good compressors and bad ones. Newer formats like ASTC have better quality but are more complex. ASTC has the advantage of selectable compression ratio from 8 bits per pixel down to 0.89 bits per pixel.

FullyFunctional · on Dec 20, 2021

All these points are spot on, but I want to reorder the priority. I cannot believe that a format is introduced in this millennium that doesn't have a single thought to aiding parallel access/decode. The single-thread focus will not age well; even low-end µProcessors are going multicore today, and going forward that will only accellerate.

kevingadd · on Dec 20, 2021

Even ignoring parallelism, images increasingly need to go straight to the GPU, so improving the performance of access on-GPU (I.E. using a GPU compressed texture format) can have a bigger positive impact on the user experience than making the file on disk smaller. Adopting Basis to compress your images could in practice be much better than adopting QOI, and it should be benchmarked against options like that along with PNG or JPEG.

pcwalton · on Dec 20, 2021

Neat, this is impressive!

The first thing that comes to mind is that it's not very amenable to parallel encoding or decoding, even less so than PNG. The fact that QOI_OP_INDEX could reach arbitrarily far back basically eliminates parallel decompression. But I guess the simplest workaround for that would be to just break up your images into tiles that are each decoded in parallel. You could imagine a simple wrapper format that specifies how to reconstruct a larger image out of such tiles, which would be amenable to parallelism (albeit still not easily implementable on GPU, though I guess you could wring a little bit of parallelism out of it from by using one GPU SIMD lane per RGBA channel).

Of course, it's entirely possible that your decompression goes so fast it doesn't matter that it's sequential and you're bottlenecked on I/O in any reasonable use case…

jonsneyers · on Dec 20, 2021

See also: https://twitter.com/jonsneyers/status/1472959101416226823?t=...

On my laptop, QOI encodes a test corpus at about 78 MP/s to a density of about 8.0 bits per pixel. Fast png encoding with fpng is 71 MP/s and reaches about 7.8 bpp. Fast lossless jxl encoding can be done at 162 MP/s (single-core, faster with more cores) and reaches about 6.8 bpp. It can also be decoded in parallel, unlike QOI and PNG. It doesn't have a single-page spec though :)

adgjlsfhk1 · on Dec 20, 2021

Note that the reference QOI implementation is not speed focused. Some of the other ones are almost twice as fast.

ape4 · on Dec 20, 2021

I like how simple it is. Of course, getting a new format adopted is pretty difficult.

shdon · on Dec 21, 2021

I honestly quite like this. Was looking for a quick way to store and transmit some small images both with and without alpha information without having to bring in a relatively large dependency like libpng and still have a modicum of compression. Having the whole thing in a single file without any external dependencies is quite cool.

I especially appreciate that the spec is concise (yet clear) enough to fit on a single page and I was able to code a full encoder/decoder that generated bit-for-bit identical results to the reference implementation in just over an hour just by reading that spec.

Zababa · on Dec 20, 2021

There is, I think, one thing missing: a comprehensive test suite that checks all the possible edge cases.

Edit: as you can see below, the benchmark also checks for correctness so you can disregard this.

phoboslab · on Dec 20, 2021

What exactly do you mean with "edge cases"? For what it's worth, the decoder has been fuzzed[1] and the benchmark suite[2] with 2800 images en-/ and decodes without losing any pixels, as checked by qoibench[3].

[1] https://github.com/phoboslab/qoi/blob/master/qoifuzz.c

[2] https://qoiformat.org/benchmark/

[3] https://github.com/phoboslab/qoi/blob/master/qoibench.c#L440...

Zababa · on Dec 20, 2021

Oh, I didn't realize that the benchmark also checked for correctness.

CodesInChaos · on Dec 20, 2021

The end marker feels awkward:

* It shouldn't be necessary, since having encoded all pixels already implies you're at the end.

* It complicates the encoding/decoding of the indexing operation.

edflsafoiewq · on Dec 20, 2021

It doesn't really complicate the encoder. Any real encoder would never emit 7 OP_INDEX to the same index, it would use a RLE instead.

ape4 · on Dec 20, 2021

Since its simple and there have been vulnerabilities in other image readers... a mathematically proven reader might be a nice thing to sell it.

louisstow · on Dec 20, 2021

Can someone explain how it's lossless if there's a chance of hash collisions in the index?

ErikCorry · on Dec 20, 2021

The encoder can always generate the code for a literal pixel if the required colour is not found in the index.

adgjlsfhk1 · on Dec 21, 2021

TLDR is hash collisions are an explicit part of the format. Encode and Decode are both specified to use the same hash function, so the hash collisions will all cancel out.