> And yet the current RISC-V approach is not as good as MXP
I watched that presentation a while ago, and while the figures that are shown look nice, I suspect the crux is that I'm not sure whether MXP is practically implementable? I'm not at all an expert on this topic, so take this with a large grain of salt. Anyway:
1) With MXP instead of a vector register file you have a scratchpad memory, i.e. a chunk of byte-addressable memory in the CPU. Now, if you want multiple vector ALU's (lanes), that scratchpad then needs to be multi-ported, which quickly starts to eat up a lot of area and power. In contrast, a vector regfile can be split into single-ported per lane chunks, saving power and area.
2) MXP seems to be dependent on these shuffling engines to align the data and feed to the ALU's. What's the overhead of these? Seems far from trivial?
As for other potential models, I have to admit I'm not entirely convinced by their dismissal of the SIMT style model. Sure, it needs a bit more micro-architectural state, a program counter per vector lane, basically. But there's also a certain simplicity in the model, no need for separate vector instructions, for the really basic stuff you need only the fork/join type instructions to switch from scalar execution to SIMT and back. And there's no denying that SIMT has been an extremely successful programming model in the GPU world.
> It's also not fair to compare instructions executed between SIMD and some huge vector register implementation. Most common RISC-V CPUs are likely to have smaller vector register from 256 to 512 bits wide.
True; the more interesting parts is the overhead stuff. Does your ISA require vector load/stores to be aligned one a vector-size boundary? Well, then when vectorizing a loop you need a scalar pre-loop to handle the first elements until you hit the right alignment and can use the vectorized stuff. Similarly, how do you handle the tail of the loop if the number of elements is not a multiple of the vector length? If you don't have a vector length register or such you need a separate tail loop. Or is the data in memory contiguous? Without scatter-gather and strided load/store you have to choose between not vectorizing or packing the data.
That bloats the code and is one of the reasons why autovectorizing for SIMD ISA's is difficult for compilers, as often the compiler doesn't know how many iterations a loop will be executed, and due to the above a large number of iterations are necessary to amortize the overhead. With a "proper" vector ISA the overhead is very small and it's profitable to vectorize all loops the compiler is able to.
I watched that presentation a while ago, and while the figures that are shown look nice, I suspect the crux is that I'm not sure whether MXP is practically implementable? I'm not at all an expert on this topic, so take this with a large grain of salt. Anyway:
1) With MXP instead of a vector register file you have a scratchpad memory, i.e. a chunk of byte-addressable memory in the CPU. Now, if you want multiple vector ALU's (lanes), that scratchpad then needs to be multi-ported, which quickly starts to eat up a lot of area and power. In contrast, a vector regfile can be split into single-ported per lane chunks, saving power and area.
2) MXP seems to be dependent on these shuffling engines to align the data and feed to the ALU's. What's the overhead of these? Seems far from trivial?
As for other potential models, I have to admit I'm not entirely convinced by their dismissal of the SIMT style model. Sure, it needs a bit more micro-architectural state, a program counter per vector lane, basically. But there's also a certain simplicity in the model, no need for separate vector instructions, for the really basic stuff you need only the fork/join type instructions to switch from scalar execution to SIMT and back. And there's no denying that SIMT has been an extremely successful programming model in the GPU world.
> It's also not fair to compare instructions executed between SIMD and some huge vector register implementation. Most common RISC-V CPUs are likely to have smaller vector register from 256 to 512 bits wide.
True; the more interesting parts is the overhead stuff. Does your ISA require vector load/stores to be aligned one a vector-size boundary? Well, then when vectorizing a loop you need a scalar pre-loop to handle the first elements until you hit the right alignment and can use the vectorized stuff. Similarly, how do you handle the tail of the loop if the number of elements is not a multiple of the vector length? If you don't have a vector length register or such you need a separate tail loop. Or is the data in memory contiguous? Without scatter-gather and strided load/store you have to choose between not vectorizing or packing the data.
That bloats the code and is one of the reasons why autovectorizing for SIMD ISA's is difficult for compilers, as often the compiler doesn't know how many iterations a loop will be executed, and due to the above a large number of iterations are necessary to amortize the overhead. With a "proper" vector ISA the overhead is very small and it's profitable to vectorize all loops the compiler is able to.