> Unfortunately, the reference code is more than 12.000 lines of unrolled C code. This code is basically duplicated without a thought in every single project.
Not by me. Overly bulky code is one of my pet peeves[1]. Here is my implementation of HQ2x in 6KB, 188 lines of code (not including the header file, which is 1KB and used for external linking.)
I devised a few observations to reduce the code. First is that the 256 patterns have four cases for each corner. If you rotate the eight surrounding pixels twice, the case blending rules match another corner (sans two cases which appear to be typos in the original algorithm. I asked the author, he apparently made the entire unrolled table by hand.) Now that the algorithm is cut down by 75%, put the switch statement into a lookup table.
To speed up the algorithm (and which further reduces its size), I use an old 16<>32-bit expand/compact trick, a YUV lookup table, and a really clever mask compare that blargg came up with. At this point, my version is already significant faster than the original algorithm. But I also added in OpenMP parallelization support, which really makes things run fast on multicore systems.
But anyway ... this guy clearly bested me. 560 lines with HQ3x and HQ4x included is even more impressive. Hats off to him!
([1] I also have inflate in 8kb of code, unzip in (inflate+2KB) of code, PNG decompression in (inflate+8kb) of code, SHA256 hashing in 4KB of code, an SMTP client with authentication+MIME attachments+HTML in 8KB of code, an XML parser (sans DTDs) in 6KB of code, etc.)
It was a mostly manual process. I was trying to understand why four rules (one for each output pixel) were needed inside every case statement. So I wrote a parser for the original source file that turned each output pixel rule into an integer representing what the code for it was actually doing.
From here I noticed the strong symmetry when you rotated the pixels. Which of course makes perfect sense, I was just able to confirm it here.
So I tried to reproduce the same tables using a LUT and rotate table, and matched all but two instances. This seemed really weird to me, since I figured this would have to be generated unrolled code, right? So I asked Maxim about it, and he said that they were probably mistakes. I was pretty shocked myself that he did this by hand.
I did a few blind comparison photos on my board to see if anyone could tell the difference (in practice it amounted to one or two pixel differences in each 256x224 upscaled image), and everyone actually favored mine, so I just went ahead and ignored those two differences.
I'm curious if the blog author noticed the same thing, or if he noticed the rotation quicker and only built his tables off of the top-left output rules.
Ultimately it looks like the blog author found a better route to reducing the comparison rules than I did. But I still don't think we've 100% cracked how and why the original author chose the rules he did. And since it sounds like it was done almost by sight, we probably never will.
Still, I'd really like to see if someone could find a way to generate the comparison rules mathematically. It's very likely that if we had some sort of scientific basis for when and why to blend each case, we could improve the visual appearance beyond that of the original algorithm. Obviously perfection doesn't exist. I'd expect the end result to be a function that takes tuning variables, and different input images would look better with different tuning values. But even for the default case, I bet we could do better.
Well, I just assumed it was symmetrical because I couldn't imagine it to be unrolled manually (I just quickly watched the visual representations of each side)... But now that you pointed out this I guess I'm going to check for differences, in all the filters. Do you remember on which side the typo was present?
And yeah, we still haven't really cracked the algorithm. I initially wanted to write the article as "understanding hqx", but I actually realized I still don't really get the rules.
Could you share the board post where you discuss your implementation?
Please understand that I wrote this code around late 2005, early 2006.
Unfortunately the discussion thread where I talked about this initially (on the ZSNES forums) has been deleted, or is for some reason not showing up in Google search results.
But you can see how hazy my memory is. There I said there seemed to be six differences, but was already unsure of the exact number.
I am unable to find any posts where I mentioned what the exact errors were, nor where I spoke with Maxim about them. But if you really want to find out, it should only take a few minutes to write a parser for hq2x.c, generate the tables using my rotation approach, and compare the two. If you do so, please post which rules are different so I can file the info away somewhere.
Although I will say it's not too terribly important. The difference is extremely minimal.
"sans two cases which appear to be typos in the original algorithm."
By chance, do you have more details on which two specific cases are 'typos'? It would be fun to try fixing them in the original code and see what kind of change occurs in the output.
I posted this below to ux, but I'll repeat it here so you see it.
Sorry, but I was unable to find which ones were at issue. I wrote this code in 2005 - 2006, and have forgotten the relevant details by this point.
The info on how I did the symmetry and rotation was posted here, where it looks like I said there were six differences, but was still unsure of the exact number: http://forums.nesdev.com/viewtopic.php?p=82794#p82794
I'm a bit too swamped to do it myself, but if you or anyone else would be up for repeating my steps to determine which rules have the symmetry issues, please do post here.
Thanks for the extra reply effort, didn't mean to make you do extra work. I'll keep my eyes open and see if I can work it out (but I'm way behind ux, who's more likely than I within any reasonable time).
And find the relevant code inside the nall/ subfolder. Note that the code inside nall/ (and libco/, ruby/, phoenix/) is public domain, or ISC if you prefer. The rest is GPLv3.
Not by me. Overly bulky code is one of my pet peeves[1]. Here is my implementation of HQ2x in 6KB, 188 lines of code (not including the header file, which is 1KB and used for external linking.)
http://pastebin.com/raw.php?i=DsNupdbc
I devised a few observations to reduce the code. First is that the 256 patterns have four cases for each corner. If you rotate the eight surrounding pixels twice, the case blending rules match another corner (sans two cases which appear to be typos in the original algorithm. I asked the author, he apparently made the entire unrolled table by hand.) Now that the algorithm is cut down by 75%, put the switch statement into a lookup table.
To speed up the algorithm (and which further reduces its size), I use an old 16<>32-bit expand/compact trick, a YUV lookup table, and a really clever mask compare that blargg came up with. At this point, my version is already significant faster than the original algorithm. But I also added in OpenMP parallelization support, which really makes things run fast on multicore systems.
But anyway ... this guy clearly bested me. 560 lines with HQ3x and HQ4x included is even more impressive. Hats off to him!
([1] I also have inflate in 8kb of code, unzip in (inflate+2KB) of code, PNG decompression in (inflate+8kb) of code, SHA256 hashing in 4KB of code, an SMTP client with authentication+MIME attachments+HTML in 8KB of code, an XML parser (sans DTDs) in 6KB of code, etc.)