Yes, I've seen it. It's quite neat. On the other hand it seems more complex to implement in hardware.
In the simplest possible design, the cost of vector support in the MRISC32 is essentially the cost of the register file (i.e. the register memory), since the vector control logic just consists of a few adders and flip-flops, and the scalar execution units can be reused for vector operations.
I was under the impression that we split the registers file on the floating/integer pipelines because those extra muxes between register file and execution units as well as on the bypass network really cut into the critical path. It's only on vector units with their long pipelines that we don't really care.
Ya but a modern processor has virtual registers and perhaps hundreds of them, so im guessing this might be historical. Also it should be noted that there is some strong connection between floating point operations and implicit registers like EAX ( _mm_movemask_epi8 is one example)
They still keep the physical register files split between integer and floating/vector, despite the renaming that happens on each of the pipes. Ironically enough the integer file has more physical registers than the vector file on most uarchs (although less bits).
If your processor uses register renaming, the frequency of your core limits how many physical registers you can have in the register file. Therefore, you are incentivized to split integer and floating point register files so you can size up each register file to its limit.
You can also play other games like internal recoding the FP format if you split the register files.
On the other hand, you can remove the need for IntToFP and FPToInt move instructions if you use one register file.
At the end of the day though, I suspect it largely comes down to having more addressable architectural state to keep the two register files separate. Why have only 32 scalar registers when you can have 32 scalar int and 32 scalar fp registers?
I suspect that what you say is true. One thing worth considering though is that on x86_64 you actually use the vector register file for floating point (and it can do integer too!).
On the MRISC32 you have a vector register file with 32 registers that can easily be configured to do scalar operations (integer or floating point) by setting the vector length to 1.
Have you looked at the Convex C-Series architecture? They copied the Cray vector idea. Starting with the C2 (I think) there was also a Vector Mask (VM) register which described which vector elements had valid data. Since the C-Series (and the Cray, I think) processed vectors serially, vector ops using the VM would only spend time on the valid elements. It was targeted at codes like this:
for (i = 0; i < n; i++) {
if (A[i] < 34) {
D[i] = A[i] * B[i] + C[i];
}
}
That would translate into code like this. (Actual instructions and mnemonics lost to the mists of time; this is pseudo-assembler.)
ld vl, #N ; assume N <= 128 (-:
ld.v v0, A ; load A
ld s0, #32
less.v v0, s0 ; stores boolean vector into VM
ld.v.t v1, B ; load B where mask == true
mul.v.t v3, v1, v0 ; calc A*B under mask, store in v3
ld.v.t v2, C ; load C under mask
add.v.t v3, v3, v2 ; calc A*B+C under mask, store in v3
st.v.t v3, D
I am ignoring the difference between I32, I64, F32 and F64 because I don't remember how those were coded into the mnemonics, sorry.
There were also instructions to load and store the VM to a scalar register pair.
The Mill architecture, if I understood the lectures, only has a masked store instruction; other vector instructions calculate the values for all elements and maintain an "invalid result" bitvector. An exception is only triggered if an invalid result is actually stored.
Thanks for the reference - I have not heard about the Convex C-Series before so I'll be sure to look it up.
While the MRISC32-A1 does serial vector processing, the idea is that you should be able to do parallel processing with the same ISA, so I'm not certain that masking in that fashion is a good option for the MRISC32. Also, having a vector mask register like the Cray limits the size of the vector registers (e.g. to 32 elements in the MRISC32, or 64 elements in a 64-bit architecture).
The Mill is far ahead of everybody else, and even vectorizes regular for loops. A most beautiful architecture, what it needs is an implementation, and get out of their bubble. Intel and the Mill are destined for each other, but they don't seem to know how to join forces.
Indeed. Intel sees the Mill as insignificant, and itself as the Gorilla in the room. But in terms of the architecture, the Mill is on the order of 10x better. That's incredible. While Intel tends to rely on strong-arming it is the #1 Mill enemy while it has to go alone. But it should definitely not be so.
Excellent project hope it will keep developing in future. We need projects like these to detach from the grip of certain vendors.
This is interesting project also based on expired patents on SuperH CPU, j-core.org also they have nice and informative video presentation on youtube
this is very cool!
any idea what sort of gate-count we are looking at for synthesizing the VHDL?
I'm not familiar with the Intel FPGA you refer to on the docs for the VHDL implementation:
I'm still learning FPGA:s myself, so I'm not sure if I can give you a relevant figure. Currently the A1 pipeline (no memory subsystem) synthesises to 2000-3000 ALMs (or about 15% of the logic in the device I'm targeting). That will increase as I implement more instructions (the FPU in particular), but on the other hand it will probably decrease when I switch to using BRAM for the scalar register file.
https://pdfs.semanticscholar.org/b9e8/fcf11b662a31cecd5a08d5... https://www.sigarch.org/simd-instructions-considered-harmful...