Prevous attempts at taking a regular CPU architecture and turning it into a GPU haven't worked so well --- it's what Intel attempted many years ago --- so I'm not too optimistic about how this will turn out. Then again, the boring and uniform nature of RISC-V may be more suitable to mass parallelisation in a GPU than as a CPU.
No, it's not the GPU ISA. Complex chips tend to embed management cores, for example to handle power management or various configuration operations across the chip. NVidia used a proprietary ISA (Falcon) for such cores, and moved to RISC V. But this is independent from the GPU ISA itself, which remains NVidia proprietary. So in the end, in their embedded SoC you can find (at least) 3 ISAs: ARM for CPU, NVidia proprietary for the GPU, and RISC V for NVidia management cores. And it's likely there are more, for example for audio DSP and other IPs embedded programmable units.
From what I researched, Falcon is used on nVidia GPUs for everything _but_ the graphics, for example to enforce a security model. But it's a start. Probably they will start using for other purposes as well, for example video encoders and decoders. I very much doubt though they will scutte their existing GPU architectures for RISC-V yet.
Every attempt to do what other people have failed at benefits from hindsight.
There's been one major success case: A64Fx and ARM SVE. The Fujitsu processor just recently took #1 in LINPACK, #1 Green500 (also the most efficient) and also #1 in HPCG (an alternative benchmark that GPUs traditionally suck at).
Its not a full GPU yet, but that A64Fx is clearly a powerful SIMD compute cluster, very similar to a GPU. It has succeeded where the Xeon Phi seems to have failed.
It's unclear that this particular project has learned from Larrabee hindsight. The graphics part of the GPU doesn't seem to be thought through at all.
That said, if the goal is to compete with existing GPUs for general compute workload, the graphics part is unnecessary anyway. That could work out, though calling it a GPU is a red herring.
And yet in-order blending is so much more power- and area-efficient with fixed function hardware support that, if your goal is to support graphics, especially for a mobile part, you'd be silly not to add it. The same applies to rasterization and a number of other similar things.
Again, this doesn't matter for a pure compute part. It's just not particularly clear from the article whether they're even aware of the distinction.
This doesn’t quite seem like that as the implementation of the instructions isn’t yet specified. Even if it is integrated, it can use dedicated VRAM and not have to use the CPU’s ALU.
Very cool. It's interesting that they are planning to start with Vulkan support, followed by OpenGL/DX/etc.
I guess it makes sense; the RISC-V crowd might possibly skew towards early adoption over backwards-compatibility.
It would be really cool to see some implementations from places like SiFive/GigaDevice/etc. Imagine how easy driver support could be if everyone used and contributed to the same open IPs...
They can also take advantage of zink to implement opengl, at least on platforms that support mesa; and use wine's implementations of direct3d.
I expect that, in the future, gl and d3d will be implemented entirely on top of vulkan, and thus be more portable and make it easier to make GPUs and graphics drivers.
> I expect that, in the future, gl and d3d will be implemented entirely on top of vulkan
I'm not sure whether you're talking about the "official" Microsoft D3D libs here, but I very much doubt that'll ever happen. They don't like dependencies they can't control. The Excel team used to maintain their own compiler because they didn't want to be dependent on the Visual C++ team who worked in the same building.
Vulkan is not even supported on UWP or Win32 sandboxes, the ICD mechanism its drivers use from the OpenGL days is only allowed in classical Win32 mode.
Version 3.0 is the latest on iDevices and while Android can do up to 3.2, it is an optional API.
Likewise, Metal is the name of the game on iOS nowadays, and Vulkan was introduced in Android 7 as optional API, and only became mandatory in Android 10.
Vulkan also carries on the tradition of Khronos APIs, extension spaghetti, so while a device might support Vulkan, it doesn't mean it supports the Vulkan that the application actually needs.
It'll be interesting to see if this goes the Larrabee route (a flat array of CPUs which share everything) or the traditional GPU route (multiple levels of shared resources).
You’d have two separate sets of cores. The general purpose cores will have all the usual extensive you’d expect.
Meanwhile, the GPU cores will be simple cores with huge vector engines. This honestly seems somewhat similar to RDNA at a high level where you have a large SIMD unit with a scalar cores for branching and one-off calculations.
The big payoff is cohesiveness. Your RISC-V GPU shares the same memory model as your CPU. This should have a payoff in easy integration. Likewise, sharing a good permission model can probably help with all those GPU exploits in third-party code (like webGL).
AVX style vectors IMHO should have been used for all those vec3 and vec4 operations that are used in graphics applications. They should not have been used for general parallelization of loops - although they have had some success in that.
GPUs have geometrically relevant vectors as basic data types. I wish language designers would realize the value in that, but they are too concerned with abstraction.
RISC-V vectors are meant for the abstract general parallel concepts, but could also help with graphics code.
There really seems to be a disconnect around which types of vectors are good for which uses.
>> That is the naive application of SIMD, which many have attempted. It does not lead to a meaningful speedup.
How can it not provide a meaningful speedup? If I have 3- or 4-element vectors a,b, and c, with scalars u,v. I want to compute:
a = ub + vc;
These should be first class data types, passed by value, and that line of code should take 2 SIMD instructions at most. That should also not require a fancy vectorizing compiler to do the analysis to find the parallelism because the data types map directly to the ISA vector registers.
By and large, GPUs don't
have this anymore, actually (they used to, but this changed many years ago). On modern GPUs your example is "scalarized" and results in separate instructions for each vector component. Of course execution is still "vectorized", but only in the sense of operating on the data for 16/32/64 threads at once.
The reason for this design is that while vec3 and vec4 are common, they are by no means universal in modern shader code. A micro-architecture built on vec3 or vec4 would be under-utilized in the large amounts of shader code that are scalar in the source language.
There are some exceptions, e.g. native support for 16-bit vec2 comes naturally on architectures with 32-bit registers. But those are exceptions.
I think SIMD referred to Intel SSE*/AVX/etc., where the vectors are of fixed length. That meant new instructions had to keep getting introduced for new sizes of vectors.
It's the whole - full core on a GPU aspect, which probably isn't a bad idea for a few reasons, could even offload much of the driver onto it and with that, make platform drivers much more manageable...maybe.
The article doesn't write about mixed precision machine learning. Will it be available? I would really love to see a RISC-V laptop using vector extensions beating the Apple M1 core in everything (though I know that it will take a lot of time to get to 5nm process node).
It mentions support for INT8 for ML/AI. FP16 and BF16 support isn't explicitly mentioned, but it does support 8/16/32-bit fixed/float scalars and vectors, so it could be added later.
I think a FP16 or BF16 ALU could be added sooner rather than later, but getting something equivalent to the performance of say a Nvidia 1050 or 1080 (first consumer cards to have FP16 and INT8 from Nvidia I think) would take a couple of iterations..
Cheap phones running Android would be a great start for RV64/RV64X though.
I wonder what company would put in the hundreds of millions to design a RISC-V processor on 5nm? I’m not being facetious — I wonder if there will be a future where we see that! It would be super exciting.
Chinese companies will take that step for sure once China manages to bring domestic advanced nodes into production. It doesn't make sense to have domestic semiconductor and then use a proprietary ISA like ARM or x86(_64) again.
Naive question but can you not target just one use case: ml, graphics etc to begin with? Something like tpu? Why does it need support every use case of a gpu in the initial design?
- Focusing on one use case might make it difficult to generalize the design later. Lessons learned might not carry over, and the platform will get anchored into particular design choices. This only gets worse the more the solution gets deployed.
- Having a product that can do both makes it simpler to do applications that combine multiple computational loads. Else, the data has to be transferred to another device. This seems unattractive since graphics, ML and massively parallel scientific computations are more similar to each other than to what CPUs are used for.
- Development budget gets fragmented, which is a disadvantage since the different use cases might not correspond to markets that are large enough to support each product.
- One of the platforms might just die off. Users that use both would be stuck in between and would have to find a solution, and most likely it will be NVidia/AMD again.
After apple setting a precedence with packaged RAM I fear in the future upgradeable RAM might become rare on everything but high end workstations.
I hope this will not happen, but especially certain laptop companies always wanted to have non upgradeable RAM and only stepped back because users where to unhappy about it. But now they will point to Apple and tell you that's necessary for fast low power laptops with long battery live.
Well given Apple doesn't appear to be gaining a latency advantage from their ram on package the jury seems to still be out on that theory.
OTOH, the trend has been to put HBM/edram/etc on package for a while. Various intel/etc products have done that, which provides a fast tier of memory that you put in a local NUMA domain. Then put the actual DRAM in its own domain.
The problem so far is that strategy doesn't actually work well for most applications because they aren't sufficiently NUMA aware to take advantage of it. Instead what ends up happening is the "fast" ram ends up being a LLC tier.
It all ends up sorta being a physics thing though, it seems as you increase capacity, distance tends to grow as well. Meaning that there will be a limit to how much RAM they can toss on package and still gain a latency advantage over just soldering it to the board.
Do you think we could extend the memory of most 8 and 16 bit home computers?
If we wanted to upgrade our computers, beyond plugging stuff on the external expansion port, we had to buy a new model.
The PC was the precedence, and only due to IBM losing control of it.
Now with laptops and tablets becoming the standard consumer computers, we are back into the 80-90's computer form models, just instead of plugging them into the TV, the screen comes along for the ride.
And actually it was great, because it meant developers had to learn to extract all the juice of the computers people had, instead of expecting us to spend money upgrading.
Do you think we could extend the memory of most 8 and 16 bit home computers?
I'm not sure how your defining "most" home computers, but the c64/128 & apple ][ line were some of the most popular and they definitely had ram upgrades.
The soldered laptop ram thing, is fairly recent and by no means universal. I suspect it would be rarer than it is, if retailers put a little note on the machine descriptions "upgradeable RAM/upgradable disk" but they can't even be bothered to put the cpu clock rates (or sometimes even the core count) on the machine descriptions.
Well, ram upgrades on those machines didn't just plug into slots. Various C64 upgrades were replacement RAM chips, or piggy backed on the existing chips. Same for the II line, although frequently there were RAM upgrade slots (like the IIGS for example). But beyond that most of these machines had the ram and slots in the same clock domain (aka a "local bus"). So a plugin card's ram basically ran at the same rate as the onboard ram. Of course there were often banking issues on the 8bit machines, but that happened regardless of where the ram was located as things like the 128k IIC (a machine without slots) were banked from the factory, and various mod level upgrades replaced various chips on the board, until the later revisions when apple provided a sanctions mechanism for upgrading the machine to 1M.
The name is slightly reminiscent of https://en.wikipedia.org/wiki/File:RIVA_TNT2_VANTA_GPU.jpg