"Traditionally GDAL has treated this flag as having no relevance to the georeferencing of the image despite disputes from a variety of other software developers and data producers. This was based on the authors interpretation of something said once by the GeoTIFF author. However, a recent review of section [section 2.5.2.2] of the GeoTIFF specificaiton has made it clear that GDAL behavior is incorrect and that PixelIsPoint georeferencing needs to be offset by a half a pixel when transformed to the GDAL georeferencing model."
Placing pixel centers at 0.5, 0.5 is only the obvious choice if you think pixels are little squares rather than point samples. Pixels-as-squares makes intuitive sense to people, especially those raised on pixel art, but it's just one possible choice you can make for your reconstruction function. It's not even a particularly good choice. It doesn't model real sensors or real displays, and it doesn't have particularly nice theoretical properties. The only thing going for it is that it's cheaper to compute some things.
It’s true that pixels aren’t most accurately modeled as squares, but they should still be centered at (0.5, 0.5), because you want the center of mass of a W×H pixel image to be at exactly (W/2, H/2) no matter what shape the pixels are. Otherwise it shifts around when you resize the image—perhaps even by much more than 1 pixel if you resize it by a large factor.
Unfortunately doesn't make sense when you need to look up pixel 0.5,0.5 in the framebuffer.
When dealing with cameras, the central point us rarely h/2,w/2. So you're really dealing with two sets of coordinates, camera coordinates and sensor coordinates, that need to be converted between.
Integer coordinates are convenient for accessing the sensor pixels, and the camera-to-sensor space transform should theoretically include for the 0.5,0.5 offset. However, getting a calibration within 0.5 pixels accuracy is going to be hard to begin with.
Nobody’s suggesting that pixels are stored at half-integer memory addresses. After all, only a small subset of the continuous image space will lie exactly on the grid of pixel centers—and this is true no matter how the grid is offset. The point is that the grid should be considered as being lined up with (0.5, 0.5) rather than with (0, 0).
So, for example, if you’re scaling an image up by 10× with bilinear interpolation, and you need to figure out what to store at address (7, 23) in the output framebuffer, you should convert that to continuous coordinates (7.5, 23.5), scale these continuous coordinates down to (0.75, 2.35), and use that to take the appropriate weighted average of the surrounding input pixels centered at (0.5, 1.5), (1.5, 1.5), (0.5, 2.5), and (1.5, 2.5), which are located at address (0, 1), (1, 1), (0, 2), and (1, 2) in the input framebuffer. The result will be different and visually more correct than if you had done the computation without taking the (0.5, 0.5) offset into account. In this case the naive computation would instead give you a combination of the pixels at (0, 2), (1, 2), (0, 3), and (1, 3) in the input framebuffer, and the result would appear to be shifted by a subpixel offset. This was essentially the cause of a GIMP bug that I reported in 2009: https://bugzilla.gnome.org/show_bug.cgi?id=592628.
If your reconstruction function is symmetric, which it should be, then your pixel centers are at 0.5, 0.5. All commonly used resampling algorithms are symmetric, except perhaps truncating implementations of point sampling (which are thus inconsistent with the rest and arguably wrong for this reason). It doesn't matter how you choose to reconstruct, samples should still be centered around their reconstruction function.
Otherwise, you're shifting the image around every time you reconstruct. This causes errors and can be very hard to reason about.
I have mixed feelings about this memo. It's right about practical aspects of resampling filters, but tries too hard to justify that with sampling theory. For example, pixel-aligned sharp edges exist and are meaningful in images, unlike perfectly square waves in sampling theory.
There's no problem with viewing an image as a wave, however if you do so you also need to accept that it contains frequency components well beyond the Nyquist frequency so the sampling theorem will be of no help in reconstructing the image.
Still lots of operations on image can be stated in terms of convolution (or a regularized inverse of it) so it's not like Fourier analysis is entirely useless.
Pixel aligned sharp edges do not actually exist and are not meaningful in images, because a perfectly sharp lens does not exist (and cannot exist) and as a result you can never form a sharp edge in an image. You also have de-focus that prevents you from doing so, and a lens that has a wider depth of field has an immediately noticeable limit in sharpness.
Even if you were somehow able to create a perfect lens, you would not be able to create a perfectly sharp edge with real world objects.
See, here you're committing the same error that the paper does: pretending pixels are all about photography and optics, and ignoring that some computer-generated graphics actually are supposed to represent perfect squares.
Sometimes, pixels really are little squares. Not always, but not never, either.
I disagree. Would you say a specific address in an array has edges? No, because it's an address, a fixed point. Even if that array accidentally described a square, the individual array addresses would have nothing to do with that.
It's not a question of the representation, it's a question of quanta. Pixels are data, the little squares are the artifacts your LCD eventually produces with the help of that data.
When your GPU is rasterizing the edges of polygons, it computes (sometimes just approximates) how much of a little square is covered by that polygon and uses that as the weight when averaging what color to assign to that pixel. The resulting rendered image is most correctly interpreted as an array of little squares, not point samples and definitely not truncated gaussians.
Actually no, that's not the case, rasterization is on-off at the hardware level. You need anti-aliasing for the behaviour you are describing, which very rarely works the way you describe - the best we have right now as far as quality uses multisampling.
And that is not done by the GPU. The top comment started with "your GPU does...."
In any case, that is an extreme edge case of software renderers that doesn't even come close to a significant part of 2D graphics in real life. Indeed, most 2D graphics is really flat 3D graphics done using GPU routines and does not the work that way. I know that some extreme edge cases do use coverage based rasterization, but :
>You need anti-aliasing for the behaviour you are describing, which very rarely works the way you describe
This is a case of anti-aliasing (read the title of the article) and is extremely rarely used. It's essentially irrelevant when discussing how graphics work in real life.
I really cannot overstate just how rarely software rasterizers are used for interactive graphics in 2020, coverage based rasterizers are an even smaller subset of that. It really makes a ton more sense to use a GPU rasterizer and use MSAA or oversample the whole image.
2D graphics on the GPU is an open research problem. My understanding is that piet-gpu and Pathfinder, both state of the art research projects, use coverage-based solutions based on the GPU. MSAAx16, which is incredibly expensive on mobile devices, only provides 16 different coverage values, and from my limited tests was poorer quality than a coverage-based solution.
2D graphics on the GPU is not an open research problem in practice. In real life you either use Direct2D or OpenGL/Vulkan/Direct3D and just... ignore a dimension.
Yes, MSAA 16x is incredibly expensive on mobile devices, and it provides a worse result than a coverage based approach. But MSAA 16x is done by an asic, and is simpler than coverage based AA. It is not even close in performance. A GPU ROP trounces any programmable compute unit as far as performance, it's not even closed. It is done by pecialized, in silico hardware. And in practice MSAA 8x is more than good enough, especially on mobile devices. You certainly will not notice a difference on a phone with a density of 563 dpi between MSAA 4x and 8x, let alone 16x and coverage based.
At those scales, the resolution of the phone is literally greater than the optical resolution of the optical system that is your eyes. There is no point in anything beyond MSAA 4x in reality, and a lot of people with displays in the 200 dpi range just use 2X MSAA while they could use 8X MSAA because they really can't tell the difference.
The final nail in the coffin is that these compute-based rasterization engines so far more or less match the performance of CPU rasterization. This is simply unacceptable when GPU direct rasterization can give results nearly indistinguishable at multiple times the performance and much less power usage. This is literally taking something done by a highly optimized, 12-7nm ASIC, and trying to do it through compute for a tiny improvement. It's absurd.
Some rasterization algorithms are done this way, but they arguably are getting suboptimal results, and would do better to apply some other filter, instead of a box. (As pixels keep getting smaller and smaller it matters less though.)
> resulting rendered image is most correctly interpreted as an array of little squares
Still nope. What matters in the end is the viewer’s eyes/brain reconstruction of the image, and given the frequency response of human eyes to typical screens at typical viewing distances, there is little if any practical difference between convolving some eye-like reconstruction filter with pixels thought of as uniform-brightness squares vs. point samples.
If you want to improve your results you’ll get much more bang for your buck from considering RGB subpixels to be point samples offset the appropriate amounts for the given physical display than you’ll get from thinking of any of them as being an area light source instead of a point.
The data often "knows" where it will be displayed, and is designed with that knowledge. To go to the absurd extreme here, the data is just 1s and 0s, not pixels or samples or anything more.
Of course that's nonsense, because the data has context and the arrangement of those samples or pixels has a purpose.
Sometimes that purpose is to serve as a sampling of a real-world continuous image, other times it is to describe the arrangement and color of tiny little squares on an LCD screen.
Depending on what context you're working in pixels can be squares, or may not be.
A UI element will almost never be perfectly square, exactly aligned on a pixel. And even if it was, LCD pixels aren't even squares. They are three rectangles with different colors and gaps between them. This fact means that if you want to have proper representation of even a pixel-perfect square, because of the gap between pixels and because of sub-pixels, a perfect representation would not be such.
Most pixel art today is designed around square pixels. The pixels are typically blown up to some multiple for viewing - and in many cases, not even an integer multiple! This has resulted in the design of filter kernels that make some trade-off between the true sharpness of square pixels and the aliasing of nearest neighbor interpolation.
Therefore, even before images hit the display there is a rationale to avoid using photographic resampling techniques, just because the method of their authoring defined the meaning otherwise.
See also: The gradual evolution of font rendering techniques. Earlier versions of Windows aimed for pixel-grid snapping to produce clean, sharp edges. Newer ones introduced anti-aliasing techniques, but again made trade-offs towards sharp edges, including exploitation of the LCD display format, while the contemporary Apple rendering favored the photographic approach. With the introduction of high DPI the differences have become less pronounced, but there's still disagreement about how best to render vector fonts into pixels.
Then how come a column of white pixels next to a column of black pixels looks perfectly sharp, as sharp as a black sheet of paper on a white background?
Whatever pixels "are", they are certainly tools for inducing a certain perception in the eye of the viewer, so we should go by how that works. We're used to this from color itself - no one denies that 0xffff00 is "yellow", despite the physical emission containing no yellow light, because red + green in certain proportions induces the same response in our eyes as light whose wavelength is actually yellow. So why can't we apply the term "squares" to things that our eyes see as squares even if they are not physically squares?
Open this in IrfanView, set the magnification to 100%. It looks about 75%-25% dark grey to light grey unless you really focus on it. This is exactly what you would expect if you conceptualize pixels as points.
So no, a column of white pixels next to a column of black pixels does not look perfectly sharp. If it did, the image above would look perfectly black and white, with no grey. And by the time I approach the pixels enough to see perfect black vs perfect white, I also notice the black gaps between pixels, and very soon the sub-pixels themselves!
This is also why you don't see individual sub-pixels, by the way.
Your eyes largely cannot distinguish the individual squares that pixels are if you are using a non-ancient screen. Your eye does not form an image of the square pixel. It largely loses them as they blend into each other to form an image, in which pixels are much, much closer to points than they are to squares.
Not 'perfectly' sharp, but if you have a bunch of objects where the edges go from 0 to 200, and you have a bunch of objects where the edges go from 0 to 100 to 200, there is a significant difference in sharpness on most screens.
Sure. And that is also perfectly consistent with considering pixels are dimensionless points forming an image, which they really are in practice much more than they are simply squares.
Indeed, an edge that goes from 0 to 100 can be considered as part of a wave of twice the frequency but the same amplitude as compared to an edge that goes from 0 to 200. Which is, by the way, why increasing the contrast in an image, especially micro-contrast, in practice increases resolution.
This is supposed to be an edge with nearly perfect sharpness.
If you take a single point sample, then slowly moving objects will appear to jump an entire pixel at a time, looking awful.
If you antialias, then movement will look smooth, but you'll also notice than when you align to the pixel squares the edge will preserve its sharpness better.
You have to be really careful when you're applying wave equations to resolution, especially when declaring that a certain number of samples fully captures an image.
If you want to display a perfect image with point samples, you may need to go as far as 10x the 'retina' density.
https://en.wikipedia.org/wiki/Hyperacuity_(scientific_term)
Take a sheet of white paper 10 inches wide and draw 480 evenly spaced vertical lines on that paper. Do you see the lines? Or do you see a gray sheet of paper?
>looks perfectly sharp, as sharp as a black sheet of paper
They don't. Pixels are so small that they they start looking like (or are well into looking like) a sample that is used to from an image. If pixels "looked perfectly sharp, as sharp as a black sheet of paper", this wouldn't be the case.
As far as your eyes, the sharp lines might as well be blurry waves of an appropriate contrast. So in an image sense, the lines don't really exist anymore, all that exists is a blurry brightness function and not infinitely sharp lines.
I don't know about yours, but my mouse cursor is not a perfectly straight 2+ pixel wide solid-colored rectangle, so that doesn't really matter. As for windows edges, I'll give you that one.
However, the edge of my window happens not to be green, so it doesn't actually align with the elements of my LCD :) I just so happen not to notice it, because my eyes don't have individual pixel resolution, almost as if there was a low-pass filter of the order > 2 lambda... Food for thought!
That's true, but we're really getting there fast. My very sharp Minolta F1.4 full-frame lens is diffraction limited already at f/8 to f/11 at a definition of 6000 x 4000. It's also much, much sharper than the vast majority of pictures people take.
While yes, 6000x4000 is a lot, monitors are already coming out with higher resolutions, so it's relevant right now. The fraction of images taken at an actual resolution so high that it has to be down scaled on a 6K or 8K monitor in practice is exceedingly thin. Even with a Sony A7RIV, an insane camera, and mind boggingly sharp lenses, most of your pictures after the Bayer filter and taking into account sampling (which is very real in photography due to Moire), most of your pictures either due to depth of field, optical abberations or motion blur, will not be at the level where you can create a truly sharp line of frequency higher than that between two pixels.
So while it is often true, this is increasingly not the case.
Pixel-aligned sharp edges were super important on low-DPI displays, but they are increasingly less important. A one-pixel line on a phone screen is tiny, so most designs will use lines of more than one pixel. A pixel-perfect 3-pixel line doesn't look very different from an anti-aliased 3-pixel line with arbitrary floating-point coordinates on a phone.
Twitter? Because Twitter has been doing that a lot for me recently. It's like all their devs are on mobile or Retina screens and overpowered CPUs and nobody noticed the problem (and resulting processing overhead).
Worth noting that this does not apply to physical cameras - a pixel is not, in fact, a point sample, but the integral over a sub-region of the sensor plane. It's also not a complete integral - the red pixels in an image are interpolated from squares that only cover a quarter of the image plane (on 95+% of sensors). Then you bring in low pass filters (or don't), and the signal theory starts to get a bit complicated.
It doesn't apply to screens either, as pixels are - manifestly - little squares. Your screen does not apply any sort of lovely reconstruction filter over this "array of point samples".
In short, it's wrong. You can model an image as an array of point samples - however these are not "pixels".
The memo interestingly talked about screens, and that it does not contribute to the pixels as squares model because there are "overlapping shapes that serve as natural reconstruction filters"...
But it was in context of old CRT and Sony Trinitron monitors! I was wondering what it'd say about LCD screens but the memo is from 1995, and the first standalone LCDs only appeared in the mid-1990s and were expensive [1].
What it says about CRT electron beams no longer apply, but I'm guessing this still does:
> The value of a pixel is converted, for each primary color, to a voltage level. This stepped voltage is passed through electronics which, by its very nature, rounds off the edges of the level steps
> Your eye then integrates the light pattern from a group of triads into a color
> There are, as usual in imaging, overlapping shapes that serve as natural reconstruction filters
Your screen is the "lovely reconstruction filter over this 'array of point samples'".
This is, for LCDs, usually an array of little squares... sort of (probably more accurately described as an array of little rectangles of different color). Things get more complicated when you start talking about less traditional subpixel arrangements like PenTile, or the behavior of old CRTs (where you don't necessarily have fully discrete pixels at all).
Its a reconstruction of sorts, but arguably pretty much the worst possible reconstruction and pretty far from lovely, compared to e.g. the near perfect reconstruction filters in audio.
I wonder if there have been any experiments constructing displays with optical filters to provide better reconstruction. I guess the visual analogue would be image upscaling, and in that sense the reconstruction what LCDs etc provide would be comparable to nearest neighbor scaling (which generally sucks)
It depends on the goals. If the goal is accurate frequency response and in particular absence of aliasing, then a little square is a bad reconstruction filter.
But for graphics, why should this be the goal? Perhaps a better goal is high contrast of edges, and for this the box filter is one of the very best. An additional advantage of the box filter is that it has only positive support, so there's no clipping beyond white and black. This is especially helpful when rendering text.
And honestly I believe that those huge sinc-approximating reconstruction filters are overly fetishized even in the audio space. The main reason they sound "nearly perfect" is that the cutoff is safely outside the audible range. Try filtering a perfect slow sawtooth through a brick wall with a cutoff in the audio range, say 8kHz. It sounds like a very audible "ping" at that frequency, with pre-echo at that because of the symmetry of sinc.
One problem with pixelation is that if you take your beautifully sharp-edged pixel-aligned square and move it 1mm to the left, it will not be perfectly sharp edged anymore. So it lack certain uniformity. Comparing again to audio, afaik (please correct me if I'm wrong!) you can phase shift/delay signals however you wish and it should not add distortion.
Afaik this would come up in text rendering where the glyphs and strokes inevitably will not align with the pixel grid, but you would know better about that.
> you can phase shift/delay signals however you wish and it should not add distortion.
Digital audio sampling rates and resolution are far higher than the limits of perception, so there's sufficient resolution to shift by eg. half a wavelength for noise cancelling or by fractions of a wavelength for beamforming (at least, if you use the highest sample rates and resolution supported by your equipment, rather than the CD audio standard default settings).
The spatial resolution of computer graphics does not yet have such comfortable headroom. In the best cases, they've caught up to normal human visual acuity but usually not vernier acuity. Once displays outpace eyeballs to the same extent that soundcards outpace ears, it will be possible to shift an image by 1mm and have it remain as sharp-looking as the original—because a sharp edge smeared across several pixels will still be sharp enough to look "perfectly sharp" to a human.
The de-Bayering filter in a camera is a lot more sophisticated than a simple interpolation of pixels of the same color. Most real world images have a huge amount of correlation between the R,G,B channels because few colors are perfectly saturated. Filtering software takes advantage of this and gets much higher resolution than simple interpolation ever could.
Many RAW developers give you a choice of which de-bayering algorithm to use. Some are optimized for maximum detail retention. You can easily see the difference on a black-and-white resolution test target, which has perfect correlation between the R,G,B values.
What do you mean? The only objectionable content I see is a single mention of "little-squarists" [quotation marks included in original text], but that seems hardly enough of an insult to disregard the rest of the paper.
The first half of the paper is spent building up to the name-calling, it starts with "little squares", reaching a crescendo with the revelation that it is not the squares that are little, but the people! The closest it gets to a comparison ends abruptly with "we will not even further consider this possibility". The second half of the article continues as if the first half has convincing, just asserting that his solution is "so easy after all" that it must be right.
This is pedantic BS that has nothing to do with the issue in the article. The article talks about a very specific issue of offsets in practice cases. This is equivalent to somebody talking about simple circuits and then complaining that "but what about relativity".
I once made a persuasive argument that this was the proper interpretation of a pixel, and got a major app to adopt the convention. It wasn't until much later that I discovered the error.
The problem comes when you try to align raster images with vector ones. Instead of starting a line at 0,0 you need to start it at 0.5,0.5. And heaven help you if you're combining raster images at different resolutions, they'll never line up.
The proper way to work is to put the pixel center at 0,0 and let it extend from -0.5 to 0.5. This works out well with reconstruction or resizing filters, because they're symmetric around 0,0 too.
Pixel centers at (0.5, 0.5) is the convention adopted by all major graphics APIs (OpenGL, DirectX, Vulkan, Metal). That’s because it makes the math, especially around image scaling, becomes a lot simpler. What exactly the alignment problem you say isn’t solvable?
One common mistake is to not think in floating point coordinates all the way through. E.g. a rectangle that covers a single pixel should have coordinates like (100.0,100.0)-(101.0,101.0), NOT involving a 0.5 offset. You almost never offset anything by 0.5 in this convention. 1 pixel wide lines are an exception, but only because then the edges of the line are exactly at pixel boundaries.
I know that many languages have some sort of support for units. It would be nice to have libraries which explicitly say that (0,1) and (-1,1) are different, and support transforming between them. I think that this transformation comes up all the time when working with pixels that are properly aligned and centered.
I know of yt-project, which has a lot of cool support for units in the context of sci-vis. Support for transforms between coordinate systems is nice though. Would love to have that. The only hard part is that systems which try and do this sort of thing lose some of the elegance of saying "V = (0, 1)" when you also have to specify the coordinate system you're working under for every vector.
There have been some papers that do this though. I can't find the reference but I know it exists.
My understanding is that graphics are smeared on Unix systems, but sharp on Windows systems, because Unix libraries use 0.5, 0.5 as the center and Windows uses 0,0.
Can anybody confirm that they investigated that deeply and found that to be true, or is there another explanation?
I have never heard such a thing. Are you talking about font rendering differences? There are different anti-aliasing algorithms, but they don’t have anything to do with this convention.
I once made a half-pixel error trying to write a fast Fourier transform as a series of fragment shaders. That was nasty to track down! The error was smeared across all the pixels in the frequency domain causing weird subtle frequency biases in certain directions.
For instance, in GDAL there's a whole RFC for dealing with issues related to pixel corners versus pixel centers!
https://gdal.org/development/rfc/rfc33_gtiff_pixelispoint.ht...
"Traditionally GDAL has treated this flag as having no relevance to the georeferencing of the image despite disputes from a variety of other software developers and data producers. This was based on the authors interpretation of something said once by the GeoTIFF author. However, a recent review of section [section 2.5.2.2] of the GeoTIFF specificaiton has made it clear that GDAL behavior is incorrect and that PixelIsPoint georeferencing needs to be offset by a half a pixel when transformed to the GDAL georeferencing model."