RV64X: A Free, open-source GPU for RISC-V

userbinator · on Jan 31, 2021

Prevous attempts at taking a regular CPU architecture and turning it into a GPU haven't worked so well --- it's what Intel attempted many years ago --- so I'm not too optimistic about how this will turn out. Then again, the boring and uniform nature of RISC-V may be more suitable to mass parallelisation in a GPU than as a CPU.

The name is slightly reminiscent of https://en.wikipedia.org/wiki/File:RIVA_TNT2_VANTA_GPU.jpg

anp · on Jan 31, 2021

FWIW, NVIDIA have been adopting RISC-V as the internal ISA for their GPUs: https://riscv.org/wp-content/uploads/2017/05/Tue1345pm-NVIDI.... It’s definitely a different architectural approach than described in OP though.

yaantc · on Jan 31, 2021

No, it's not the GPU ISA. Complex chips tend to embed management cores, for example to handle power management or various configuration operations across the chip. NVidia used a proprietary ISA (Falcon) for such cores, and moved to RISC V. But this is independent from the GPU ISA itself, which remains NVidia proprietary. So in the end, in their embedded SoC you can find (at least) 3 ISAs: ARM for CPU, NVidia proprietary for the GPU, and RISC V for NVidia management cores. And it's likely there are more, for example for audio DSP and other IPs embedded programmable units.

samus · on Jan 31, 2021

From what I researched, Falcon is used on nVidia GPUs for everything _but_ the graphics, for example to enforce a security model. But it's a start. Probably they will start using for other purposes as well, for example video encoders and decoders. I very much doubt though they will scutte their existing GPU architectures for RISC-V yet.

djmips · on Jan 31, 2021

It's not the internal ISA of their GPU but of a CPU they use in their designs.

dragontamer · on Jan 31, 2021

Every attempt to do what other people have failed at benefits from hindsight.

There's been one major success case: A64Fx and ARM SVE. The Fujitsu processor just recently took #1 in LINPACK, #1 Green500 (also the most efficient) and also #1 in HPCG (an alternative benchmark that GPUs traditionally suck at).

Its not a full GPU yet, but that A64Fx is clearly a powerful SIMD compute cluster, very similar to a GPU. It has succeeded where the Xeon Phi seems to have failed.

atq2119 · on Jan 31, 2021

It's unclear that this particular project has learned from Larrabee hindsight. The graphics part of the GPU doesn't seem to be thought through at all.

That said, if the goal is to compete with existing GPUs for general compute workload, the graphics part is unnecessary anyway. That could work out, though calling it a GPU is a red herring.

londons_explore · on Feb 1, 2021

The money is in general gpu-like compute, not in actual GPU operations.

Blending pixels is becoming less and less important in the power budget of mobile devices too.

atq2119 · on Feb 1, 2021

And yet in-order blending is so much more power- and area-efficient with fixed function hardware support that, if your goal is to support graphics, especially for a mobile part, you'd be silly not to add it. The same applies to rasterization and a number of other similar things.

Again, this doesn't matter for a pure compute part. It's just not particularly clear from the article whether they're even aware of the distinction.

zeckalpha · on Jan 31, 2021

This doesn’t quite seem like that as the implementation of the instructions isn’t yet specified. Even if it is integrated, it can use dedicated VRAM and not have to use the CPU’s ALU.

meekrohprocess · on Jan 31, 2021

Very cool. It's interesting that they are planning to start with Vulkan support, followed by OpenGL/DX/etc.

I guess it makes sense; the RISC-V crowd might possibly skew towards early adoption over backwards-compatibility.

It would be really cool to see some implementations from places like SiFive/GigaDevice/etc. Imagine how easy driver support could be if everyone used and contributed to the same open IPs...

moonchild · on Jan 31, 2021

They can also take advantage of zink to implement opengl, at least on platforms that support mesa; and use wine's implementations of direct3d.

I expect that, in the future, gl and d3d will be implemented entirely on top of vulkan, and thus be more portable and make it easier to make GPUs and graphics drivers.

mrec · on Jan 31, 2021

> I expect that, in the future, gl and d3d will be implemented entirely on top of vulkan

I'm not sure whether you're talking about the "official" Microsoft D3D libs here, but I very much doubt that'll ever happen. They don't like dependencies they can't control. The Excel team used to maintain their own compiler because they didn't want to be dependent on the Visual C++ team who worked in the same building.

Same with Apple and Metal.

pjmlp · on Jan 31, 2021

The GP probably means the Wine implementation.

Vulkan is not even supported on UWP or Win32 sandboxes, the ICD mechanism its drivers use from the OpenGL days is only allowed in classical Win32 mode.

asiachick · on Jan 31, 2021

If they wanted to provide OpenGL ES support (all mobile devices) they could just use ANGLE which will provide OpenGL ES support on top of vulkan

https://github.com/google/angle

pjmlp · on Jan 31, 2021

All mobile devices means only up to version 2.0.

Version 3.0 is the latest on iDevices and while Android can do up to 3.2, it is an optional API.

Likewise, Metal is the name of the game on iOS nowadays, and Vulkan was introduced in Android 7 as optional API, and only became mandatory in Android 10.

Vulkan also carries on the tradition of Khronos APIs, extension spaghetti, so while a device might support Vulkan, it doesn't mean it supports the Vulkan that the application actually needs.

https://vulkan.gpuinfo.org/listdevices.php?platform=android

nynx · on Jan 31, 2021

I'd like to see more MIMD or MISD architecture exploration. 2048 core basic riscv processors anyone?

phendrenad2 · on Jan 31, 2021

It'll be interesting to see if this goes the Larrabee route (a flat array of CPUs which share everything) or the traditional GPU route (multiple levels of shared resources).

yjftsjthsd-h · on Jan 31, 2021

I'm not clear: Is this proposing that we make risc-v CPUs that also perform the work of a GPU, or is this a way to make a risc-v GPU (standalone)?

hajile · on Jan 31, 2021

You’d have two separate sets of cores. The general purpose cores will have all the usual extensive you’d expect.

Meanwhile, the GPU cores will be simple cores with huge vector engines. This honestly seems somewhat similar to RDNA at a high level where you have a large SIMD unit with a scalar cores for branching and one-off calculations.

The big payoff is cohesiveness. Your RISC-V GPU shares the same memory model as your CPU. This should have a payoff in easy integration. Likewise, sharing a good permission model can probably help with all those GPU exploits in third-party code (like webGL).

avocactus57 · on Jan 31, 2021

In terms of similarity with RDNA, you mean large cache memory which can store bigger matrixes?

xiphias2 · on Jan 31, 2021

SIMD instructions were a mistake in CPU instruction set design.

Vector instructions have register size independent code which makes vector programming from CPUs much more approachable and binary compatible.

phkahler · on Jan 31, 2021

AVX style vectors IMHO should have been used for all those vec3 and vec4 operations that are used in graphics applications. They should not have been used for general parallelization of loops - although they have had some success in that.

GPUs have geometrically relevant vectors as basic data types. I wish language designers would realize the value in that, but they are too concerned with abstraction.

RISC-V vectors are meant for the abstract general parallel concepts, but could also help with graphics code.

There really seems to be a disconnect around which types of vectors are good for which uses.

incrudible · on Jan 31, 2021

> AVX style vectors IMHO should have been used for all those vec3 and vec4 operations that are used in graphics applications.

That is the naive application of SIMD, which many have attempted. It does not lead to a meaningful speedup.

To get a good speedup from SIMD, you need SOA or AOSOA layout. Ergonomically this is already similar to vector instructions.

phkahler · on Jan 31, 2021

>> That is the naive application of SIMD, which many have attempted. It does not lead to a meaningful speedup.

How can it not provide a meaningful speedup? If I have 3- or 4-element vectors a,b, and c, with scalars u,v. I want to compute:

a = ub + vc;

These should be first class data types, passed by value, and that line of code should take 2 SIMD instructions at most. That should also not require a fancy vectorizing compiler to do the analysis to find the parallelism because the data types map directly to the ISA vector registers.

GPUs have this, the benefit should be very real.

atq2119 · on Feb 1, 2021

By and large, GPUs don't have this anymore, actually (they used to, but this changed many years ago). On modern GPUs your example is "scalarized" and results in separate instructions for each vector component. Of course execution is still "vectorized", but only in the sense of operating on the data for 16/32/64 threads at once.

The reason for this design is that while vec3 and vec4 are common, they are by no means universal in modern shader code. A micro-architecture built on vec3 or vec4 would be under-utilized in the large amounts of shader code that are scalar in the source language.

There are some exceptions, e.g. native support for 16-bit vec2 comes naturally on architectures with 32-bit registers. But those are exceptions.

zeckalpha · on Jan 31, 2021

Vector instructions are a type of SIMD

throwaway81523 · on Jan 31, 2021

I think SIMD referred to Intel SSE*/AVX/etc., where the vectors are of fixed length. That meant new instructions had to keep getting introduced for new sizes of vectors.

zeckalpha · on Jan 31, 2021

SIMD as a term predates those Intel instructions by 30 years.

oregontechninja · on Jan 31, 2021

It looks like you could have C+GPU cores. I bet you could still split the gpu portion out into its own separate thing if you wished.

Zenst · on Jan 31, 2021

Somehow this all reminded me of https://en.wikipedia.org/wiki/Larrabee_(microarchitecture)

It's the whole - full core on a GPU aspect, which probably isn't a bad idea for a few reasons, could even offload much of the driver onto it and with that, make platform drivers much more manageable...maybe.

sitkack · on Jan 31, 2021

From a high enough level they are architecturally similar. But these x86 cores don't scale _down_ to 500-1000 logic elements per core.

This is an architectural level, not a microarchitectual level design (yet).

xiphias2 · on Jan 31, 2021

The article doesn't write about mixed precision machine learning. Will it be available? I would really love to see a RISC-V laptop using vector extensions beating the Apple M1 core in everything (though I know that it will take a lot of time to get to 5nm process node).

ece · on Jan 31, 2021

It mentions support for INT8 for ML/AI. FP16 and BF16 support isn't explicitly mentioned, but it does support 8/16/32-bit fixed/float scalars and vectors, so it could be added later.

xiphias2 · on Jan 31, 2021

So the answer is no :(

Anyways I expect RISC-V to appear in Chinese mobile phones before laptops as the main unit.

It would be fun to run Android on this RV64X architecture....I guess that's the goal.

ece · on Jan 31, 2021

I think a FP16 or BF16 ALU could be added sooner rather than later, but getting something equivalent to the performance of say a Nvidia 1050 or 1080 (first consumer cards to have FP16 and INT8 from Nvidia I think) would take a couple of iterations..

Cheap phones running Android would be a great start for RV64/RV64X though.

girvo · on Jan 31, 2021

I wonder what company would put in the hundreds of millions to design a RISC-V processor on 5nm? I’m not being facetious — I wonder if there will be a future where we see that! It would be super exciting.

samus · on Jan 31, 2021

Chinese companies will take that step for sure once China manages to bring domestic advanced nodes into production. It doesn't make sense to have domestic semiconductor and then use a proprietary ISA like ARM or x86(_64) again.

airocker · on Jan 31, 2021

Naive question but can you not target just one use case: ml, graphics etc to begin with? Something like tpu? Why does it need support every use case of a gpu in the initial design?

samus · on Jan 31, 2021

Multiple reasons might apply:

- Focusing on one use case might make it difficult to generalize the design later. Lessons learned might not carry over, and the platform will get anchored into particular design choices. This only gets worse the more the solution gets deployed.

- Having a product that can do both makes it simpler to do applications that combine multiple computational loads. Else, the data has to be transferred to another device. This seems unattractive since graphics, ML and massively parallel scientific computations are more similar to each other than to what CPUs are used for.

- Development budget gets fragmented, which is a disadvantage since the different use cases might not correspond to markets that are large enough to support each product.

- One of the platforms might just die off. Users that use both would be stuck in between and would have to find a solution, and most likely it will be NVidia/AMD again.

villgax · on Jan 31, 2021

Wish GPUs came with upgradable RAM slots like SODIMMs in laptops

userbinator · on Jan 31, 2021

They used to, some of them in the 90s had expandable VRAM:

https://upload.wikimedia.org/wikipedia/commons/f/f8/DIAMONDS...

https://upload.wikimedia.org/wikipedia/commons/6/6b/Matrox_M...

http://vgamuseum.info/images/palcal/profi/665_elsa_xl_top_hq...

dathinab · on Jan 31, 2021

After apple setting a precedence with packaged RAM I fear in the future upgradeable RAM might become rare on everything but high end workstations.

I hope this will not happen, but especially certain laptop companies always wanted to have non upgradeable RAM and only stepped back because users where to unhappy about it. But now they will point to Apple and tell you that's necessary for fast low power laptops with long battery live.

StillBored · on Feb 1, 2021

Well given Apple doesn't appear to be gaining a latency advantage from their ram on package the jury seems to still be out on that theory.

OTOH, the trend has been to put HBM/edram/etc on package for a while. Various intel/etc products have done that, which provides a fast tier of memory that you put in a local NUMA domain. Then put the actual DRAM in its own domain.

The problem so far is that strategy doesn't actually work well for most applications because they aren't sufficiently NUMA aware to take advantage of it. Instead what ends up happening is the "fast" ram ends up being a LLC tier.

It all ends up sorta being a physics thing though, it seems as you increase capacity, distance tends to grow as well. Meaning that there will be a limit to how much RAM they can toss on package and still gain a latency advantage over just soldering it to the board.

pjmlp · on Jan 31, 2021

What precedence?

Do you think we could extend the memory of most 8 and 16 bit home computers?

If we wanted to upgrade our computers, beyond plugging stuff on the external expansion port, we had to buy a new model.

The PC was the precedence, and only due to IBM losing control of it.

Now with laptops and tablets becoming the standard consumer computers, we are back into the 80-90's computer form models, just instead of plugging them into the TV, the screen comes along for the ride.

And actually it was great, because it meant developers had to learn to extract all the juice of the computers people had, instead of expecting us to spend money upgrading.

StillBored · on Jan 31, 2021

  Do you think we could extend the memory of most 8 and 16 bit home computers?

I'm not sure how your defining "most" home computers, but the c64/128 & apple ][ line were some of the most popular and they definitely had ram upgrades.

https://www.c64-wiki.com/wiki/Commodore_REU

The soldered laptop ram thing, is fairly recent and by no means universal. I suspect it would be rarer than it is, if retailers put a little note on the machine descriptions "upgradeable RAM/upgradable disk" but they can't even be bothered to put the cpu clock rates (or sometimes even the core count) on the machine descriptions.

pjmlp · on Jan 31, 2021

That is something that gets plugged on the external ports, as I mentioned.

Which I do confess never to have seen, as C64 and Apple weren't anywhere to be seen in the Iberian Peninsula.

StillBored · on Feb 1, 2021

Well, ram upgrades on those machines didn't just plug into slots. Various C64 upgrades were replacement RAM chips, or piggy backed on the existing chips. Same for the II line, although frequently there were RAM upgrade slots (like the IIGS for example). But beyond that most of these machines had the ram and slots in the same clock domain (aka a "local bus"). So a plugin card's ram basically ran at the same rate as the onboard ram. Of course there were often banking issues on the 8bit machines, but that happened regardless of where the ram was located as things like the 128k IIC (a machine without slots) were banked from the factory, and various mod level upgrades replaced various chips on the board, until the later revisions when apple provided a sanctions mechanism for upgrading the machine to 1M.

villgax · on Jan 31, 2021

The screen point is quite valid, a decade of 1080p everywhere just went by on laptops

zeckalpha · on Jan 31, 2021

SoCs are the future, much lower latency. Perhaps the SoC can be upgradable like the Raspberry Pi CM.