I've dealt with image decoding for a game engine before. The images in question were large pixel-art texture atlases, stored as PNG files and loaded at startup. Their slow decoding speed caught me by surprise, given that the file format is 25 years old!
The reason turned out to be that Deflate's design makes it very hard to parallelise or SIMD-accelerate. Even the best available decoders are basically forced to process a byte at a time, in a straight line from start to finish, which obviously isn't a good match for modern CPUs. The 3x to 4x decompression speedup here is nice, but I can't help but feel a little sad about how poorly this format is taking advantage of available compute resources. (The ultimate dream would be a format which is so parallel-friendly that you can just send the binary blob to the GPU and decompress it in a compute shader!)
Even a rule like "each row is compressed separately, with a table of row lengths at the beginning of the file" might have been enough - this would have made compression ratios slightly worse, but complexity wouldn't have suffered too much. With only six different chunk types, we could perhaps imagine a branchless decoder where each row's decoding state is stored in its own SIMD lane, and the results for several rows are all emitted at the same time...
Apple apparently ships a PNG encoder that completely flushes the deflate stream every now and then, giving you additional offsets where you can start decompressing without dependencies on previous data. The offsets are then saved as a separate chunk in the PNG. Decoders aware of this can parallelize decoding using these offsets, everyone else can read it as normal.
The proper solution would of course be to use something more modern than deflate, the PNG format even has a metadata field that specifies the used compression algorithm. But anything but deflate wouldn't be backwards compatible, so nobody seems to use that option.
The Apple encoder only splits the PNG in the middle.[0] This was linked from a recent HN post about a hack that exploits this to have PNGs that display differently in some situations on Apple machines.[1]
> The 3x to 4x decompression speedup here is nice, but I can't help but feel a little sad about how poorly this format is taking advantage of available compute resources. (The ultimate dream would be a format which is so parallel-friendly that you can just send the binary blob to the GPU and decompress it in a compute shader!)
You can't usefully decompress the entire thing on the GPU, but you can do half of it there. You can decompress DEFLATE right before sending the data over the bus and do the PNG prediction (filtering) in a compute shader. I actually implemented this once (PNG prediction is amenable to wavefront parallelism) and it worked fine, though it wasn't any faster than using SIMD on the CPU because the fact that each row can specify a different prediction method means that you end up with a lot of divergence in the shader. Still, it could get some load off the CPU if your loading is CPU-bound.
Reading that blog post, I was surprised to learn that many modern games dedicate more than half of their filesize to textures. I haven't played an AAA game in more than a decade, but I would have thought that meshes and (particularly) audio would use up more space.
It sounds like developers are stuck between a rock and a hard place. They need one of the specific compressed pixel formats which can be efficiently read by the GPU, but those formats are about ten times larger than a similar JPEG file, they don't losslessly compress well (modulo RAD Game Tools' discoveries here), and recompressing raw pixels to a GPU-addressable format at load time would be orders-of-magnitude too slow.
RAD Game Tools' approach here is clever, but it feels like a bit of a hack. The obvious next step would be a lossy compressed image format which can decompress directly to BC7, exploiting spatial-correlation tricks similar to those which PNG uses to get better results than a gzipped BMP file. Has anybody tried this already?
I wouldn't call Oodle Texture a hack. But there's also https://github.com/BinomialLLC/crunch and https://github.com/BinomialLLC/basis_universal. The latter has the advantage that it can be decoded to multiple different compressed texture formats, so that all GPUs can be supported from a single file (different GPUs support different formats so you can't necessarily send the same compressed texture to any GPU).
When looking at the benchmark result and image size, it seems like this format is good at images that have horizontal structures or where the same color is repeated often. It's not good at vertical structures, including images with enlarged pixels (where one row is identical to the one above). I think the main reason for that is that PNG has a filter pass that can look at the row of pixels above, while this format only looks to the left.
One thing worth noting is that QOI can be pretty easily wrapped with something else to change pixel traversal order. If you want higher compression, I'd recommend wrapping QOI with ZSTD (or similar) compression on the output, and something to change the pixel traversal order on the input.
This is going to be super useful for educational purposes. I'd really like to see more file formats like this.
There's already a bunch of minimal programming languages and even some operating systems. It's great value for learning stuff when you can implement a quite OK version of something in dozens of hours or less.
Why? If the goal is to be fast and simple then surely little endian would make more sense given that basically all processors are little endian. Big endian just means you need to swap everything twice.
Well, it's easier to read in a hex editor and seemed like the right thing to do for a "wire format".
The only places where it matters is the header's width/height and order of RGB(A) data. Storing RGBA as ABGR is not a great choice, imho. If you want to be endian independent, you have to read one byte at a time anyway (or swap).
I don't believe it would imply that, because those are independent elements. RGBA would still be stored as R, G, B, A unless you're considering each pixel as a single 32-bit integer. A C structure for the colors would even live in memory in the order you expect. Just like a string is still stored byte-by-byte even in little-endian contexts.
LE vs BE would only affect multi-byte integers and floats, not arrays or structures of single-byte elements.
Because almost everywhere you would use data will expect it as interleaved. Separating the channels would require a second pass to re-interleave the data.
Support for 16/32b and fp-formats would've been nice -- the compressor is actually orthogonal to the bit width of the channels. Custom fp-compression usually just unvolves moving the sign bit into the LSB, and delta encoding the mantissa separately from the exponent -- although the latter isn't as important.
It was pretty good, but in the end I don't think I'll be making it part of the graphics system. I would really like a bit of random access into the image without having to divide it up into multiple tiles. And the spec took a direction that didn't really suit me - i wanted to use it for Gui-like textures which have slabs of colours and anti-aliased edges with varying alpha. In the final version of the spec there's no way to code "the alpha changed, but the color is the same". without spending 5 bytes on a single pixel. Previously that took 2 bytes.
So I'm still looking for a format with:
* Very small amount of decode state (GIF's 16k is probably too much - PNG's 32k is certainly too much).
* Fast decode (these embedded chips are not as fast as desktops).
* Alpha channel (not just on-off like GIF)
* A bit of random access like the parallel PNG, but hopefully less bolted-on.
Have you considered compressed texture formats, perhaps DXT5? Fixed compression ratio (1 byte per pixel for DXT5), arbitrary random access, decompress the pixels you need on the fly. Can be further compressed for efficient storage to about 1-2 bits per pixel with https://github.com/BinomialLLC/crunch.
This is pretty neat. Simple enough to decode, and very simple to do random access. And if we ever port to something with a GPU it's might even be supported. But I worry it might be too visibly lossy, especially for GUI-ish textures.
It's certainly not going to be perceptually lossless for every image, but good compressors can do pretty well. There's a big difference in quality between good compressors and bad ones. Newer formats like ASTC have better quality but are more complex. ASTC has the advantage of selectable compression ratio from 8 bits per pixel down to 0.89 bits per pixel.
All these points are spot on, but I want to reorder the priority. I cannot believe that a format is introduced in this millennium that doesn't have a single thought to aiding parallel access/decode. The single-thread focus will not age well; even low-end µProcessors are going multicore today, and going forward that will only accellerate.
Even ignoring parallelism, images increasingly need to go straight to the GPU, so improving the performance of access on-GPU (I.E. using a GPU compressed texture format) can have a bigger positive impact on the user experience than making the file on disk smaller. Adopting Basis to compress your images could in practice be much better than adopting QOI, and it should be benchmarked against options like that along with PNG or JPEG.
The first thing that comes to mind is that it's not very amenable to parallel encoding or decoding, even less so than PNG. The fact that QOI_OP_INDEX could reach arbitrarily far back basically eliminates parallel decompression. But I guess the simplest workaround for that would be to just break up your images into tiles that are each decoded in parallel. You could imagine a simple wrapper format that specifies how to reconstruct a larger image out of such tiles, which would be amenable to parallelism (albeit still not easily implementable on GPU, though I guess you could wring a little bit of parallelism out of it from by using one GPU SIMD lane per RGBA channel).
Of course, it's entirely possible that your decompression goes so fast it doesn't matter that it's sequential and you're bottlenecked on I/O in any reasonable use case…
On my laptop, QOI encodes a test corpus at about 78 MP/s to a density of about 8.0 bits per pixel. Fast png encoding with fpng is 71 MP/s and reaches about 7.8 bpp.
Fast lossless jxl encoding can be done at 162 MP/s (single-core, faster with more cores) and reaches about 6.8 bpp. It can also be decoded in parallel, unlike QOI and PNG. It doesn't have a single-page spec though :)
I honestly quite like this. Was looking for a quick way to store and transmit some small images both with and without alpha information without having to bring in a relatively large dependency like libpng and still have a modicum of compression. Having the whole thing in a single file without any external dependencies is quite cool.
I especially appreciate that the spec is concise (yet clear) enough to fit on a single page and I was able to code a full encoder/decoder that generated bit-for-bit identical results to the reference implementation in just over an hour just by reading that spec.
What exactly do you mean with "edge cases"? For what it's worth, the decoder has been fuzzed[1] and the benchmark suite[2] with 2800 images en-/ and decodes without losing any pixels, as checked by qoibench[3].
TLDR is hash collisions are an explicit part of the format. Encode and Decode are both specified to use the same hash function, so the hash collisions will all cancel out.
The reason turned out to be that Deflate's design makes it very hard to parallelise or SIMD-accelerate. Even the best available decoders are basically forced to process a byte at a time, in a straight line from start to finish, which obviously isn't a good match for modern CPUs. The 3x to 4x decompression speedup here is nice, but I can't help but feel a little sad about how poorly this format is taking advantage of available compute resources. (The ultimate dream would be a format which is so parallel-friendly that you can just send the binary blob to the GPU and decompress it in a compute shader!)
Even a rule like "each row is compressed separately, with a table of row lengths at the beginning of the file" might have been enough - this would have made compression ratios slightly worse, but complexity wouldn't have suffered too much. With only six different chunk types, we could perhaps imagine a branchless decoder where each row's decoding state is stored in its own SIMD lane, and the results for several rows are all emitted at the same time...