Hacker News new | past | comments | ask | show | jobs | submit login
2D Graphics on Modern GPU (2019) (raphlinus.github.io)
178 points by peter_d_sherman on March 15, 2021 | hide | past | favorite | 87 comments



Hi again! This post was an early exploration into GPU rendering. The work continues as piet-gpu, and, while it's not yet a complete 2D renderer, there's good progress, and also an active open source community including people from the Gio UI project. I've not long ago implemented nested clipping (which can be generalized to blend modes) and have a half-finished blog post draft. I'm also working on this as my work as a researcher on the Google Fonts team. Feel free to ask questions in this thread - I probably won't follow up with everything, as the discussion is pretty sprawling.


One small error (I think) - I noticed your link to pathfinder linked to someone's 2020 fork of the repository rather than the upstream servo repository.


> someone's 2020 fork

pcwalton was the developer behind pathfinder at Mozilla (but was part of past summer's layoff).


The quality is not good, and the performance is not even mentioned.

I have used different approach: https://github.com/Const-me/Vrmac#vector-graphics-engine

My version is cross-platform, tested with Direct3D 12 and GLES 3.1.

My version does not view GPUs as SIMD CPU, it actually views them as a GPU.

When rendering a square without anti-aliasing, the library will render 2 triangles. When rendering a filled square with anti-aliasing, the library will render about 10 triangles, large opaque square in the center, and thin border about 1 pixel thick around it for AA.

It uses hardware Z buffer with early Z rejection to save pixel shaders and fill rate. It uses screen-space derivatives in the pixel shader for anti-aliasing. It renders arbitrarily complex 2D scenes with only two draw calls, one front to back with opaque stuff, another back to front with translucent stuff. It does not have quality issues with stroked lines much thinner than 1px.


> The quality is not good, and the performance is not even mentioned.

I notice that your renderer doesn't even attempt to render text on the GPU and instead just blits glyphs from an atlas texture rendered with Freetype on the CPU: https://github.com/Const-me/Vrmac/blob/master/Vrmac/Draw/Sha...

In contrast, piet-gpu (the subject of the original blog post) has high enough path rendering quality (and performance) to render glyphs purely on the GPU. This makes it clear you didn't even perform a cursory investigation of the project before making a comment to dump on it and promote your own library.


> instead just blits glyphs from an atlas texture rendered with Freetype on the CPU

Correct.

> has high enough path rendering quality (and performance) to render glyphs purely on the GPU

Do you have screenshots showing quality, and performance measures showing speed? Ideally from Raspberry Pi 4?

> This makes it clear you didn't even perform a cursory investigation of the project

I did, and mentioned in the docs, here’s a quote: “I didn’t want to experiment with GPU-based splines. AFAIK the research is not there just yet.” Verifiable because version control: https://github.com/Const-me/Vrmac/blob/bbe83b9722dcb080f1aed...

For text, I think bitmaps are better than splines. I can see how splines are cool from a naïve programmer’s perspective, but practically speaking they are not good enough for the job.

Vector fonts are not resolution independent because hinting. Fonts include a bytecode of compiled programs who do that. GPUs are massively parallel vector chips, not a good fit to interpret byte code of a traditional programming language. This means whatever splines you gonna upload to GPU will only contain a single size of the font, trying to reuse for different resolution will cause artifacts.

Glyphs are small and contain lots of curves. Lots of data to store, and lots of math to render, for comparatively small count of output pixels. Copying bitmaps is very fast, modern GPUs, even low-power mobile and embedded ones, are designed to output ridiculous volume of textured triangles per second. Font face and size are more or less consistent within a given document/page/screen. Apart from synthetic tests, glyphs are reused a lot, and there’re not too many of them.

When I started the project, the very first support of compute shaders on Pi 4 was just introduced in the Mesa upstream repo. Was not yet in the official OS images. Bugs are very likely in versions 1.0 of anything at all.

Finally, even if Pi 4 had awesome support for compute shaders back them, the raw compute power of the GPU is not that impressive. Here in my Windows PC, my GPU is 30 times faster than CPU in terms of raw FP32 performance. With that kind of performance gap, you can probably make GPU splines work fast enough, after spending enough time on development. Meanwhile, on Pi 4 there’s no difference, the quad-core CPU has raw performance pretty close to the raw performance of the GPU. To lesser extent same applies to low-end PCs: I only have a fast GPU because I’m a graphics programmer, many people are happy with their Intel UHD graphics, these are not necessarily faster than CPUs.


> > This makes it clear you didn't even perform a cursory investigation of the project

> I did, and mentioned in the docs, here’s a quote: “I didn’t want to experiment with GPU-based splines. AFAIK the research is not there just yet.”

Not what I said. I said that you didn't investigate the project discussed in the original blog post before declaring, in your words, that "the quality is not good" and comparing it to your own library.

Vrmacs and piet-gpu are two totally different types of renderer. Vrmacs draws paths by decomposing them into triangles, rendering them with the GPU rasterizer, and antialiasing edges using screen-space derivatives in the fragment shader. This approach works great for large paths, or paths without too much detail per pixel, but it isn't really able to render small text, or paths with a lot of detail per pixel, with the necessary quality. (Given this and the other factors you mentioned in your reply, rendering text on the CPU with Freetype is a perfectly reasonable engineering choice and I am not criticizing it in the slightest.)

In comparison, piet-gpu decomposes paths into line segments, clips them to pixel boundaries, and analytically computes pixel coverage values using the shoelace formula/Green's theorem, all in compute shaders. This is more similar to what Freetype itself does, and it is perfectly capable of rendering high-quality small text on the GPU, in way that Vrmacs isn't without shelling out to Freetype.

Again, to be clear, I'm not criticizing any of the design choices that went into Vrmacs; it looks like it occupies a sweet spot similar to NanoVG or Dear ImGui, where it can take good advantage of the GPU for performance while still being simple and portable. My only point here is that you performed insufficient investigation of piet-gpu before confidently making an uninformed claim about it and putting it in a somewhat nonsensical comparison with your own project.


> in your words, that "the quality is not good"

Oh, you were asking why I said so? Because I have clicked the “notes document” link in the article, the OP used the same tiger test image as me, and that document has a couple of screenshots. And these were the only screenshots I have found. Compare them to screenshots of the same vector image rendered by my library, and you’ll see why I noted about the quality.

> Vrmacs draws paths by decomposing them into triangles, rendering them with the GPU rasterizer, and antialiasing edges using screen-space derivatives in the fragment shader.

More or less, but (a) not always, thin lines are different. (b) that’s a high-level overview but there’re many important details on the lower levels. For instance, “screen-space derivatives of what?” is an interesting question, critically important for correct and uniform stroke widths. The meshes I’m building are rotation-agnostic, and to some extent (but not completely) they are resolution-agnostic too.

> and it is perfectly capable of rendering high-quality small text on the GPU

It is, but the performance overhead is massive, compared to GPU rasterizer rendering these triangles. For real-world vector graphics that doesn’t have too much stuff per pixel that complexity is not needed because triangle meshes are good enough already.

> it looks like it occupies a sweet spot similar to NanoVG

They’re similarities, I have copy-pasted a few text-related things from my fork of NanoVG: https://github.com/Const-me/nanovg/ However, Vrmac delivers much higher quality of 2D vector graphics (VAA, circular arcs, thin strokes, etc), is much faster (meshes are typically reused across frames, I use more than 1 CPU core, and the performance-critical pieces are in C++ manually vectorized with NEON or SSE), and is more compatible (GL support on Windows or OSX is not good, you want D3D or Metal respectively).


The document explains above the tiger image (like, directly above it) that it is a test image meant to evaluate a hypothesis about fragment shader scheduling:

> Update (7 May): I did a test to see which threads in the fragment shader get scheduled to the same SIMD group, and there’s not enough coherence to make this workable. In the image below, all pixels are replaced by their mean in the SIMD group (active thread mask + simd_sum)

I cloned the piet-gpu repository and was able to render a very nice image of the Ghostscript tiger: https://imgur.com/a/swyW0gl


Way better than in the article, but still, I like my results better.

The problematic elements are thin black lines. On your image the lines are aliased, visible for the lines which are close to horizontal but not quite. And for curved thin lines, results in visually non-uniform thickness along the line.


The original piet-metal codebase has a tweak where very thin lines are adjusted to thicker lines with a smaller alpha value, which improves quality there. This has not yet been applied to piet-gpu.

One of the stated research goals [1] of piet-gpu is to innovate quality beyond what is expected of renderers today, including conflation artifacts, resampling filters, careful adjustment of gamma, and other things. I admit the current codebase is not there yet, but I am excited about the possibilities in the approach, much more so than pushing the rasterization pipeline as you are doing.

[1]: https://github.com/linebender/piet-gpu/blob/master/doc/visio...


I have doubts. The reason why rasterizers are so good by now — games been pushing fillrate, triangles count, texture samplers performance and quality for more than a decade.

Looking forward, I’d rather expect practical 2D renderers using the tech made for modern games. Mesh shaders, raytracing, deep learning, and even smaller features like sparse textures. These are the areas where hardware vendors are putting their transistors and research budgets.

None of the features you mentioned is impossible with rasterizers. Hardware MSAA mostly takes care about conflation artifacts, gamma is doable with higher-precision render targets (e.g. Windows requires FP32 support since D3D 11.0).


Does your approach allow for quality optimizations like sub pixel rendering for arbitrary curves? It seems like this is what is interesting about this approach.

Also in terms of “two draw calls”, does that include drawing text as part of your transparent pass, or are you assuming that your window contents are already rendered to textures?


> Does your approach allow for quality optimizations like sub pixel rendering for arbitrary curves?

The library only does grayscale AA for vector graphics.

Subpixel rendering implemented for text but comes with limitations: only works when text is not transformed (or transformed in specific ways, like rotated 180°, or horizontally flipped), and you need to pass background color behind the text to the API. Will only look good if the text is on solid color background, or on slow-changing gradient.

Sub pixel AA is hard for arbitrary backgrounds. I’m not sure many GPUs support the required blend states, and workarounds are slow.

> does that include drawing text as part of your transparent pass

Yes, that includes text and bitmaps. Here’s two HLSL shaders who do that, the GPU abstraction library I’ve picked recompiles the HLSL into GLSL or others on the fly: https://github.com/Const-me/Vrmac/blob/master/Vrmac/Draw/Sha... https://github.com/Const-me/Vrmac/blob/master/Vrmac/Draw/Sha... These shaders are compiled multiple times with different set of preprocessor macros, but I did test with all of them enabled at once.


Dumb question ... but isn't the easiest AA method rendering at a higher resolution and downsampling the result?

I see that it's not feasible for a lot of complex 3D graphics, but 2D is (probably) a lot less taxing for modern GPUs?


piet-gpu contributor here. You're right that supersampling is the easiest way to achieve AA. However, its scalability issues are immense: for 2D rendering it's typically recommended to use 32x for a decent quality AA, but as the cost of supersampling scales linearly (actually superlinearly due to memory/register pressure), it becomes more than a magnitude slower than the baseline. So if you want to do anything that is real-time (e.g. smooth page zoom without resorting to prerendered textures which becomes blurry), supersampling is mostly an unacceptable choice.

What is more practical is some form of adaptive supersampling: a lot of pixels are filled by only one path and don't require supersampling. There's also some more heuristics that can be used: one that I want to try out in piet-gpu is to exploit the fact that in 2D graphics, most pixels are only covered by at most two paths. So as a base line we can track only two values per pixel plus a coverage mask, then in rare occasions of three or more shapes overlapping, fall back to full supersampling. This should keep the cost amplification more under control.


The short answer — it depends.

What you described is called super-sampling. Supersampling is indeed not terribly hard to implement, the problem is performance overhead. Many parts of graphics pipeline scale linearly with count of pixels. If you render at 16x16 upscaled resolution, that gonna result in 256x more pixel shader invocations, and 256x the fill rate.

There’s a good middle ground called MSAA https://en.wikipedia.org/wiki/Multisample_anti-aliasing In practice, 16x MSAA often delivers very good results for both 3D and 2D. In case of 2D, even low-end PC GPUs are fast enough with 8x or 16x MSAA level.

Initial versions of my library used that method.

The problem was, Raspberry Pi 4 GPU is way slower than PC GPUs. The performance with 4x or 2x MSAA was too low for 1920x1080 resolution, even just for 2D. Maybe the problem is actual hardware, maybe it’s a performance bug in Linux kernel or GPU drivers, I have no idea. I didn’t want to mess with the kernel, I wanted a library that works fast on officially supported 32-bit Debian Linux. That’s why I bothered to implement my own method for antialiasing.


> Maybe the problem is actual hardware

I think it is - as far as I know most modern GPUd implement MSAA at the hardware level, and that's why even a mobile GPU can handle. 8x MSAA at 1080p.

I don't know anything about the Raspberry Pi GPU, but maybe you'd have better results switching to FXAA or SMAA there (by which I mean faster, not visually better).


> maybe you'd have better results switching to FXAA or SMAA there

I’ve thought about that, but decided I want to prioritize the quality. 16x MSAA was almost perfect in my tests, 8x MSAA was still good. With levels lower than that, the output quality was noticeably worse than Direct2D, which was my baseline.

And another thing. I have found out stroked lines much thinner than 1px after transforms need special handling, otherwise they don’t look good: too aliased, too thick, and/or causing temporal artifacts while zooming continuously. Some real-world vector art, like that ghostscript’s tiger from Wikipedia, has quite a lot of such lines.


High quality supersampling is about equally hard in 2D and 3D, since the result is 2D. It is also the most common solution in both 2D and 3D graphics, so your instinct is reasonably good.

But, font and path rendering, for example in web browsers and/or with PDF or SVG - these things can benefit in both efficiency and in quality from using analytic antialiasing methods. 2D vector rendering is a place where doing something harder has real payoffs.

Just a fun aside - not all supersampling algorithms are equally good. If you use the wrong filter, it's can be very surprising to discover that there are ways you can take a million samples per pixel or more and never succeed in getting rid of aliasing artifacts. (An example is if you just average the samples, aka use a Box Filter.) I have a 2D digital art project to render mathematical functions that can have arbitrarily high frequencies. I spent money making large format prints of them, so I care a lot about getting rid of aliasing problems. I've ended up with a Gaussian filter, which is a tad blurrier than experts tend to like, because everything else ends up giving me visible aliasing somewhere.


Aside to the aside - you might find that applying a sharpening filter after the Gaussian gives a good result, as it won't reintroduce every kind of aliasing, and can reduce the appearance of blurriness.


If you render two anti-aliased boxes next to each other (i.e. they share an edge), will you see a thin line there, or a solid fill? Last time I checked, cairo-based PDF readers get this wrong, for example.


Good question.

If you use the fillRectangle method https://github.com/Const-me/Vrmac/blob/1.2/Vrmac/Draw/iDrawC... to draw them I think you should get solid fill. That particular API doesn’t use AA for that shape. Modern GPU hardware with their rasterization rules https://docs.microsoft.com/en-us/windows/win32/direct3d11/d3... is good at that use case, otherwise 3D meshes would contain holes between triangles.

If you render them as 2 distinct paths, filled, not stroked, and anti-aliased, you indeed gonna see a thin line between them. Currently, my AA method shrinks filled paths by about 0.5 pixels. For stroked paths it’s the opposite BTW, the output is inflated by half of the stroke width (the mid.point of the strokes correspond to the source geometry).

You can merge boxes into a single path with 2 figures https://github.com/Const-me/Vrmac/blob/1.2/Vrmac/Draw/Path/P..., in this case C++ code of the library should collapse the redundant inner edge and the output will be identical to a single box, i.e. solid fill. Will also render slightly faster because less triangles in the mesh.


Something else worth looking at: the slug font renderer[0]. Sadly it's patented, but the paper[1] is there for those of you in the EU.

0. http://sluglibrary.com/

1. http://jcgt.org/published/0006/02/02/paper.pdf


In 2005 Loop & Blinn [0] found a method to decide if a sample / pixel is inside or outside a bezier curve (independently of other samples, thus possible in a fragment shader) using only a few multiplications and one subtraction per sample.

    - Integral quadratic curve: One multiplication
    - Rational quadratic curve: Two multiplications
    - Integral cubic curve: Three multiplications
    - Rational cubic curve: Four multiplications

[0] https://www.microsoft.com/en-us/research/wp-content/uploads/...


It's referenced in slug's algorithm description paper [1], the main disadvantage with Loop-Blinn is the triangulation step that is required, and at small text sizes you lose a bit of performance. Slug only needs to render a quad for each glyph. That is not to say that any one method is better than the other though! They both have advantages and disadvantages. I think the two most advanced techniques for rendering vector graphics on the GPU are "Massively Parallel Vector Graphics" [2] and "Efficient GPU Path Rendering Using Scanline Rasterization" [3]. Though I don't know of any well known usage of them. Maybe it's because it's very hard to implement them, the sources attached to them are not trivial to understand, even if you've read the papers. They also use OpenCL/Cuda if I remember correctly.

[1] "GPU-Centered Font Rendering Directly from Glyph Outlines" http://jcgt.org/published/0006/02/02/

[2] http://w3.impa.br/~diego/projects/GanEtAl14/

[3] http://kunzhou.net/zjugaps/pathrendering/

EDIT: I've only now seen that [2] and [3] are already mentioned in the article

EDIT2: To compensate for my ignorance, I will add that one of the authors of MPVG has a course on rendering vector graphics: http://w3.impa.br/~diego/teaching/vg/


If I understand correctly the second link is basically an extension of Loop-Blinns implicit curve approach with vector textures in order to find the winding counter for each fragment in one pass.

>> Slug only needs to render a quad for each glyph.

I don't know how many glyphs you want to render (to the point that there are so many that you can't read them anymore), but a modern GPU s are heavily optimized for triangle throughput. So 2 or 20 triangles per glyph makes only a little difference. The bigger problem is usually the sample fill rate and memory bandwidth (especially if you have to write to pixels more than once).

I have been eying the scanline-intersection-sort approach (your third link) too. Sadly they have no answer to path stroking (same as everybody else) and it also requires an efficient sorting algorithm for the GPU (implementations of such are hard to come by outside of CUDA, as you mentioned).


Indeed, most techniques that target the GPU have no response to stroking, they recommend generating paths beforehand so that it looks like it's stroked.

And yes, the number of triangles doesn't really make a difference in general, but in Slug's paper they say:

"At small font sizes, these triangles can become very tiny and decrease thread group occupancy on the GPU, reducing performance"

I'm not experienced enough to say how true that is/how much of a difference it makes.

> If I understand correctly the second link is basically an extension of Loop-Blinns implicit curve approach with vector textures in order to find the winding counter for each fragment in one pass.

I've read the paper, but to be honest it's a bit over my head right now, but AFAIK MPVG is an extension to this [1], which looks like it's an extension to Loop-Blinn itself, so I think you're right.

[1] "Random-Access Rendering of General Vector Graphics" http://hhoppe.com/ravg.pdf


Any alternative solutions for the problem of GPU text rendering (that are not patent infringing)?


A signed distance field approach can be good depending on what you're after. https://github.com/libgdx/libgdx/wiki/Distance-field-fonts


There's a great WebGL library for doing that on the web using any .ttf, .otf, or .woff font - https://github.com/protectwise/troika/tree/master/packages/t...


You can always render the text to a texture offline as a signed distance field and just draw out quads as needed at render time. This will always be faster than drawing from the curves, and rendering from an SDF (especially multi-channel variants) scales surprisingly well if you choose the texture/glyph size well.

A little more info:

https://blog.mapbox.com/drawing-text-with-signed-distance-fi...

MIT-licensed open-source multi-channel glyph generation:

https://github.com/Chlumsky/msdfgen

The only remaining issue would be the kerning/layout, which is admittedly far from simple.


FOSS does not magically circumvent patents.


Is there a serious risk of patent enforcement in common open source repositories ranging from GitHub to PPAs and Linux package repositories located outside any relevant jurisdictions?


Does that imply it's possible to implement 2D font/vector graphics rendering on a GPU and end up getting burned by patent law? I am having a hard time imagining they were awarded such a generic patent.

Anyway, I will adjust my question based on your feedback.


Slug isn't great for lots of shapes since it does the winding order scanning per-pixel on the pixel shader. It does have a novel quadratic root-finder. Put simply, it's better suited to fonts than large vector graphics.


I've once implemented the basic idea behind the algorithm used in slug(described in the paper [1], though without the 'band' optimization, I just wanted to see how it works), and I agree with you, the real innovation is in that quadratic root-finder. It can tell you whether you are inside or outside just by manipulating the three control points of a curve, it's very fast, what remains to be done is to use an acceleration data structure so that you don't have to check for every curve. That works very well for quadratic Bézier curves, in the paper it says that it can be easily extended to cubics, though no example is provided(and I doubt it's trivial). What I think would be hard with Slug's method is extending it to draw gradients, shadows, basically general vector graphics like you say. Eric Lengyel on his twitter showed a demo [2] using Slug to render general vector graphics, but I'm not sure of how many features it supports, but it definitely supports cubic Bézier curves. I'd also like to add that the algorithm didn't impress me with how the text looks at small sizes, which I think is very important in general, though maybe not so much for games(maybe I just didn't implement it correctly).

[1] "GPU-Centered Font Rendering Directly from Glyph Outlines" http://jcgt.org/published/0006/02/02/

[2] https://twitter.com/EricLengyel/status/1190045334791057408


This would have been a lot better with examples in the form of rendered images or perhaps even a video. Maybe it's just my lack of background in graphics, but I had a lot of trouble grasping what the author was attempting to communicate without a concrete example.


It's about taking known 2D graphics and UI approaches which were developed for CPUs and looking at effective rendering engine architectures doing the same using GPUs. Terms such as "scene graph", "retained mode UI", etc. are those existing 2d graphics matter.

So the approach, afaiu, is a data layout for the scene graph that basically is the more domain general concern of mapping graph e.g. Linked List, datastructures (that are CPU friendly) to array forms (GPU friendly) suitable for parallel treatment. Other GPU concerns, such as minimizing global traffic by local caching, and mapping thread groups to tiles. I found the idea of having the scene graph resident in GPU to be interesting.

(note to author: "serialization" comes from networking roots of serializing a data structure for transmision over the net. So, definitely serial. /g)


I understand where you are coming from. There is a lot of jargon and it assumes familiarity with many concepts. I think any explanatory images which would help someone unfamiliar with the field would need to be accompanied by quite a bit of explanation.

One thing which I think made reading this a bit more work than necessary is that it feels like it's prattling on about a lot of tangential details and never quite gets to the point.

edit: a prime example of an unnecessary aside is mentioning the warp/wavefront/subgroup terminology. I feel anyone in the target audience should know this already and it's not really relevant to what's being explained.


It doesn't seem to be a finished work. I guess this is more of a journal entry on the author's initial takeaways from a week-long coding retreat.


It wouldn’t have been better, it just would have been more high level and generalized, but I don’t think that’s what the author was going for. I found the amount of detail refreshing, and as someone about to make a GPU based 2D display tree renderer, it was written at just the right level to be quite useful.


It would be fantastic if something like this were part of the modern apis. Vulkan, Metal DX12. But I guess it's not as sexy as raytracing.


I think the world is going into a more layered approach, where it's the job of the driver API (Vulkan etc) to expose the power of the hardware fairly directly, and it's the job of higher levels to express the actual rendering in terms of those lower level primitives. Raytracing is an exception because you do need hardware support to do it well.

Whether there's hardware that can make 2D rendering work better is an intriguing question. The NV path rendering stuff (mentioned elsethread) was an attempt (though I think it may be more driver/API than hardware), but I believe my research direction is better, in that it will be higher quality, faster, and more flexible with respect to the imaging model on standard compute shaders than an approach using the NV path rendering primitives. Obviously I have to back that up with empirical measurements, which is not yet done :)


Nvidia tried to make it happen[0].

Sadly, it didn't catch on.

0. https://developer.nvidia.com/nv-path-rendering


It's still there on all NVIDIA GPUs as an extension, just nobody uses it.

IMO it didn't catch on because all three of these points:

1. It only works on NVIDIA GPUs, and is riddled with joint patents from NVIDIA and Microsoft forbidding anyone like AMD or Intel from supporting it.

2. It's hard to use: you need to get your data into a format it can consume, usage is non-trivial, and often video game artists are already working with rasterized textures anyway so it's easy to omit.

3. Vector graphics suck for artists. The number of graphic designers I have met (who are the most likely subset of artists to work with vector graphics) that simply hate or do not understand the cubic bezier curve control points in Adobe, Inkscape, and other tools is incredible.


> It only works on NVIDIA GPUs, and is riddled with joint patents from NVIDIA and Microsoft forbidding anyone like AMD or Intel from supporting it.

Why do companies do this? What do they expect to get out of creating an API that is proprietary to their specific non-monopoly device, and that therefore very obviously nobody will ever actually use?


> What do they expect to get out of ...

Nvidia did exactly this with CUDA, so go take a look at the ML world for an example of how it works. It seems to be going quite well for them. A common enough refrain is "I don't really want to buy Nvidia, but my toolchain requires CUDA".

Pretty much every FPGA and SoC vendor does exactly this as far as I understand things. It's why you can't trivially run an open OS on most mobile hardware.

Apparently such schemes don't meaningfully affect the purchasing decisions of a sufficiently large fraction of people to disincentivize the behavior.


Intel and AMD also did the exact same thing with x86.

But yet, people still use it. (and Arm also has its fair share of patents, but is a licensable architecture by others)


> What do they expect to get out of creating an API that is proprietary to their specific non-monopoly device, and that therefore very obviously nobody will ever actually use?

You mean like Cuda, which is wildly successful, has a huge ecosystem, and which basically ensures you'll have to buy NVidia if you're serious about GPU computing?


CUDA is, essentially, a B2B product-feature.

A business is flexible when serving its own needs: if they want to do GPGPU computations, they can evaluate the alternatives and choose a platform/technology like CUDA to build on; and so choose to lock themselves into the particular GPUs that implement that technology. But that’s a choice they’re only making for themselves. They’re using those GPUs to compute something; they’re not forcing anyone else downstream of them to use those same GPUs to consume what they produce. The GPUs they’re using become part of their black box.

Path shaders, on the other hand, would—at least as described in the article—be chiefly a B2C / end-user product-feature. They’d be something games/apps would rely on to draw 2D UI elements. But path shaders would need to be implemented† by the developers creating those games/apps.

† (Yes, really, they would need to be implemented by the game/app developer, not game-engine framework devs. Path shaders are a low-level optimization, like Vulkan — something that is only advantageous to use when you use it directly to take advantage of its low-level semantics, rather than through a framework that tries to wrap it in some other, more traditional semantics. As with OpenGL, the ill-fitted abstractions of the traditional API were precisely what made it slow!)

And those game/app developers, unlike the developers doing GPGPU computations, have to make a decision based on what devices their customers have, rather than what their own company can buy. (Or, if they think they have a “killer app”, they can try to force their customers to buy into the platform they built on. But most companies who think they’ve built a B2C killer app, haven’t.)

For a B2B product-feature, a plurality position is fine. Even one forced to be permanent by patents.

For a B2C product-feature, a plurality position by itself is fine, because rivals will just ship something similar and offer cross-support. But an indefinite plurality position is death.

Compare, in the web ecosystem:

• Features shipped originally in one renderer as “clean, standalone” ideas, e.g. -webkit-border-radius. These were cloned (-moz-border-radius, etc.) and then standardized (plain old border-radius.)

• Features shipped in one renderer as implementations dependent on that renderer’s particular environment/ecosystem, e.g. IE’s ActiveX-based CSS “filter” stanza. Because of how they were implemented, these had no potential to ever be copied by rivals, so nobody outside of Microsoft bothered to use them on their sites, since they knew that only IE would render the results correctly, and IE at that point already had only a plurality position.


Well, you could always implement a fallback for unsupported devices and advertise for better performance on supported devices. Implementation will usually be "sponsored" by the one providing the technology. If the feature then gets adopted enough others will have to come up with something similar or support it native, too. This happens all the time with GPU and CPU vendors, both in gaming and industry.


Maybe, but I got the sense from the article that what NVIDIA has patented are the semantics of the API itself (i.e. the types of messages sent between the CPU and GPU.) A polyfill for the API might still be infringing, given that it would need to expose the same datatypes on the CPU side.

And even if it wasn’t, the path-shader semantics are so different from those of regular 2D UI frameworks, that a 2D UI framework implemented to use path-shading, that falls back to using the polyfill, might be much worse performing than one implemented using regular CPU-side 2D plotting. It would very likely also suffer from issues like flicker, judder, etc, which are much worse/more noticable than just “increased CPU usage”.


Intel and NVIDIA have a patent cross-licensing agreement at the time, which is still ongoing for all patents filed on or prior to March 31, 2017.

https://www.sec.gov/Archives/edgar/data/1045810/000119312511...

Patents here are used for mutual destruction in case one tries to sue the other, NVIDIA won't use them against AMD (and the reverse is true too) when regarding GPUs.

Intel and AMD would however enforce their patents if NVIDIA tries to enter the x86 CPU market.


> Vector graphics suck for artists

Someone mentioned Flash in this thread and that was a very approachable vector graphics tool. I don't know how many games translate to the vector style though - it's almost invariably a cartoonish look. The tools are very geometric so it just kind of nudges you towards that. Pixels these days are more like painting so it's no surprise artists like that workflow (they all secretly want to be painting painters).


> they all secretly want to be painting painters

This is going to be literally life changing for them! Quick, someone inform the artists of that!


> just nobody uses it

AFAIK Skia and derived works (Chrome, Chromium, Electron, etc.), are all using it when available.


Stencil-and-cover approaches like NV_path_rendering have a lot of drawbacks like documented below, but probably biggest of all, they're still mostly just doing the tessellation on CPU. A lot of the important things, like winding mode calculations, are handled on the CPU. Modern research is looking for ways out of that.


Actually, they calculate the winding per fragment on the GPU [0]. They require polygon tessellation only for stroking (which has no winding rule). The downside of their approach is that it is memory bandwidth limited, precisely because it does the winding on the GPU instead of using CPU tessellation to avoid overlap / overdraw.

Curve filling is pretty much solved with implicit curves, stencil-and-cover or scanline-intersection-sort approaches (all of which can be done in a GPU only fashion). Stroking is where things could still improve a lot as it is almost always approximated by polygons.

[0]: https://developer.nvidia.com/gpu-accelerated-path-rendering


Loosely related: Blend2D has been innovating a lot on this space.

https://blend2d.com/


always appreciate raph's work on rendering and UI programming, but want to ask a somewhat unrelated question to this post: does anyone have a lot of experience doing 2d graphics on the cpu? i wonder if there'll be a day we're confident doing all 2d stuff on the cpu since cpus are much easier to work with and have much more control, i also read some old 3d games are also using software rendering and did well on old hardwares, that gave me a lot of confidence in software render every (lowrez) thing


Yes? We know how to write scanline renderers for 2D graphics. They're not that hard, a simple one can be done in ~100 lines of code or so, see my article here https://magcius.github.io/xplain/article/rast1.html


Tanks for the amazing article! I wonder if you met any performance annoyances / bottlenecks when doing actual GUI / game dev with this?


You just need to render your font on the CPU once, and upload mipmaps to the GPU (either signed distance fields, or just sharpened textures, that works absolutely fine too).

I think all this GPU font rendering stuff is a bit of a red herring. Are any popular apps or games actually making heavy use of it?


Enjoyable article, thanks!


I have lots of experience, created several font renderers in the CPU and GPU.

No, doing CPU drawing is too inefficient. Without too much trouble you get 50x more efficiency in the GPU, that is, you can draw 50frames per every frame in the CPU, using the same amount of energy and time.

If you go deeper, low level with Vulkan-Metal and specially control the GPU memory, you can get 200x, being way harder to program.

CPU drawing is very useful for testing: You create the reference you compare the GPU stuff with.

CPU drawing is the past, not the future.


Thanks for the write up. Yeah i can see the huge performance difference, one thing that GPU bothers me is now every vendor provides a different set of API, you kinda have to use a hardware abstraction layer to use the GPU if you really want cross platform, and that's often a huge effort or dependency and hard to do right, even in OpenGL days it's easy because you only have to deal with one API instead of three vulkan/metal/d3d. With CPU if ignore the performance lack it's just a plain pixel array that can be easily displayed on any environment and you have control over any bits of it, I just can't get over the lightness and elegance difference between the two..


Vulkan/metal/d3d give you control to do things you could not do with OpenGL, and they are very similar. All my code works first in Vulkan and Metal, but d3d is not hard when you have your vulkan code working.

OpenGL was designed by committee and politics(like not giving you the option to compile shaders for a long time while d3d could do it).

The hard part is thinking in term extreme parallelism. That is super hard.

Once you have something working in the GPU, you can translate it to electronics like using FPGAs.

The difference in efficiency is so big, that most GPU approaches do not really work. They are really approximations that fail with certain glyphs. Designers create a model that works with most glyphs, and sometimes they have a fallback, inefficient method for them.

With the CPU you can calculate the area under a pixel exactly.


> OpenGL was designed by committee and politics

Vulkan is exactly designed the same way, by the same people.

I already lost track of the extensions, and the joke it is post Vulkan 1.0 to configure everything, to the point a graphical configuration tool is required to ease the process.


The future is now: on many systems there's fairly little GPU acceleration going on when runing eg a web browser and things work fine.


I spent a couple years learning graphics programming to build an iPad app for creating subtle animated effects. The idea was kind of like Procreate but if you had "animated brushes" that produced glimmer or other types of looping animated effects.

From what I've read, the technique behind most digital brushes is to render overlapping "stamps" over the stroke. They're spaced closely enough that you can't actually see the stamps.

But if you want to animate the stamps, you either have to store the stroke data as a very large sequence of stamp meshes or you can only work with the data in a raster format. The former is way too many meshes even with instancing, and the latter loses a lot of useful information about the stroke. Imagine you wanted to create a brush where the edges of the stroke kind of pulsate like a laser beam, you ideally want to store that stroke data in a vector format to make it easier to identify e.g. centers and edges.

But it turned out to be too challenging for me to figure out how to 1) build a vector representation of a stroke/path without losing some of the control over detail you get with the stamping technique and 2) efficiently render those vectors on the GPU.

I'm not sure if this would help with the issues I ran into, but I'm definitely excited to see some focus on 2D rendering improvements!


Is this approach novel? For instance is Apple's approach to native UI rendering doing UI rendering on the CPU, or using a 3D renderer?


Apple has an incredibly fast software 2D renderer (Quartz 2D), and limited GPU 2D renderer and compositor (Quartz Compositor). Doing PostScript rendering on the GPU is still an active research project. And Raph is doing some of that research!


That was in the past.

Apple has one of the best systems for drawing 2D and it is accelerated on the GPU. It is used by Apple Maps and it made the Offline maps of it much better than Google's.

But Apple is using trade secrets here, they are not publishing it so everyone could copy it.


Do you have a source for this? I would like to no more.


The author mentions PathFinder, the GPU font renderer from Servo a lot, so there do seem to be existing systems that do things in that way.

I'm not 100% sure though about Apple and other's approach though - definitely when compositing desktop environments were new, the way it was done was to software render UI elements into textures, and then use the GPU to composite it all together. I assume more is being done on the GPU now but it may not actually be all that performance critical for regular UIs (he talks about things like CAD which are more performance sensitive).


He's describing roughly the feature set of Flash, which is a system for efficiently putting 2D objects on top of other 2D objects.


No I mean is the approach of doing this on the GPU actually novel


Games often use the GPU for their 2D elements. It's inefficient to do a window system that way, because you have to update on every frame, but if you're updating the whole window on every frame anyway, it doesn't add cost. As the original poster points out, it does run down the battery vs. a "window damage" approach.


Yes but games typically use standard rasterization to render 2D elements. My question is whether using compute to "simulate cpu rendering" here is a novel approach.


Depends on what you mean by novel. No other “mainstream” API that implements the traditional 2D imaging model popularized by Warnock et al. with PostScript is implemented this way, except for Pathfinder. Apple does all 2D drawing operations on the CPU and composites distinct layers using the GPU. This does a lot more work on the GPU.


What about Direct2D? Surely Windows counts as mainstream? The docs are from 2018, https://docs.microsoft.com/en-us/windows/win32/direct2d/comp...


> Rendering method

> In order to maintain compatibility, GDI performs a large part of its rendering to aperture memory using the CPU. In contrast, Direct2D translates its APIs calls into Direct3D primitives and drawing operations. The result is then rendered on the GPU. Some of GDI?s rendering is performed on the GPU when the aperture memory is copied to the video memory surface representing the GDI window.

I can’t say for certain but I think the main point being communicated here is that Direct2D uses 3D driver interfaces to get its pixels on the screen. Not necessarily that it renders the image using the GPU. I could be wrong.


What about skia?


Skia renders paths on the CPU. There was a prototype of a GPU based approach called skia-compute but it was removed a few years ago. I believe some parts of skia can use SDFs for font rendering, but that's only really accurate at small sizes.


The skia-compute project is now Spinel, and is under Fuchsia. It is very interesting, perhaps the fastest way to render vector paths on the GPU, but the code is almost completely inscrutable, and it has lots of tuning parameters for specific GPU hardware, so porting is a challenge.

Skia has a requirement that rendering of paths cannot suffer from conflation artifacts (though compositing different paths can), as they don't want to regress on any existing web (SVG, canvas) content. That's made it difficult to move away from their existing software renderer which is highly optimized and deals with this well. Needless to say, I consider that an interesting challenge.


Wow the codebase looks quite small for such ambitions!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: