The MRISC32 – A vector first CPU design

jrk · on Aug 25, 2018

The RISC V V-extension follows a Cray-style model, and builds on the work with Hwacha to do mixed-precision vectors much more cleanly:

https://pdfs.semanticscholar.org/b9e8/fcf11b662a31cecd5a08d5... https://www.sigarch.org/simd-instructions-considered-harmful...

mbitsnbites · on Aug 25, 2018

Yes, I've seen it. It's quite neat. On the other hand it seems more complex to implement in hardware.

In the simplest possible design, the cost of vector support in the MRISC32 is essentially the cost of the register file (i.e. the register memory), since the vector control logic just consists of a few adders and flip-flops, and the scalar execution units can be reused for vector operations.

petermcneeley · on Aug 24, 2018

There are similarities between this and the PS3 SPE processors. The SPEs had 128 bit registers that could be interchangeably used for various sized and various typed operations. https://en.wikipedia.org/w/index.php?title=Cell_(microproces...

monocasa · on Aug 24, 2018

I was under the impression that we split the registers file on the floating/integer pipelines because those extra muxes between register file and execution units as well as on the bypass network really cut into the critical path. It's only on vector units with their long pipelines that we don't really care.

petermcneeley · on Aug 24, 2018

Ya but a modern processor has virtual registers and perhaps hundreds of them, so im guessing this might be historical. Also it should be noted that there is some strong connection between floating point operations and implicit registers like EAX ( _mm_movemask_epi8 is one example)

monocasa · on Aug 24, 2018

They still keep the physical register files split between integer and floating/vector, despite the renaming that happens on each of the pipes. Ironically enough the integer file has more physical registers than the vector file on most uarchs (although less bits).

_chris_ · on Aug 25, 2018

If your processor uses register renaming, the frequency of your core limits how many physical registers you can have in the register file. Therefore, you are incentivized to split integer and floating point register files so you can size up each register file to its limit.

You can also play other games like internal recoding the FP format if you split the register files.

On the other hand, you can remove the need for IntToFP and FPToInt move instructions if you use one register file.

At the end of the day though, I suspect it largely comes down to having more addressable architectural state to keep the two register files separate. Why have only 32 scalar registers when you can have 32 scalar int and 32 scalar fp registers?

mbitsnbites · on Aug 25, 2018

I suspect that what you say is true. One thing worth considering though is that on x86_64 you actually use the vector register file for floating point (and it can do integer too!).

On the MRISC32 you have a vector register file with 32 registers that can easily be configured to do scalar operations (integer or floating point) by setting the vector length to 1.

kbob · on Aug 25, 2018

Great project.

Have you looked at the Convex C-Series architecture? They copied the Cray vector idea. Starting with the C2 (I think) there was also a Vector Mask (VM) register which described which vector elements had valid data. Since the C-Series (and the Cray, I think) processed vectors serially, vector ops using the VM would only spend time on the valid elements. It was targeted at codes like this:

        for (i = 0; i < n; i++) {
            if (A[i] < 34) {
                D[i] = A[i] * B[i] + C[i];
            }
        }

That would translate into code like this. (Actual instructions and mnemonics lost to the mists of time; this is pseudo-assembler.)

        ld       vl, #N      ; assume N <= 128 (-:
        ld.v     v0, A       ; load A
        ld       s0, #32
        less.v   v0, s0      ; stores boolean vector into VM
        ld.v.t   v1, B       ; load B where mask == true
        mul.v.t  v3, v1, v0  ; calc A*B under mask, store in v3
        ld.v.t   v2, C       ; load C under mask
        add.v.t  v3, v3, v2  ; calc A*B+C under mask, store in v3
        st.v.t   v3, D

I am ignoring the difference between I32, I64, F32 and F64 because I don't remember how those were coded into the mnemonics, sorry.

There were also instructions to load and store the VM to a scalar register pair.

The Mill architecture, if I understood the lectures, only has a masked store instruction; other vector instructions calculate the values for all elements and maintain an "invalid result" bitvector. An exception is only triggered if an invalid result is actually stored.

mbitsnbites · on Aug 25, 2018

Thanks for the reference - I have not heard about the Convex C-Series before so I'll be sure to look it up.

While the MRISC32-A1 does serial vector processing, the idea is that you should be able to do parallel processing with the same ISA, so I'm not certain that masking in that fashion is a good option for the MRISC32. Also, having a vector mask register like the Cray limits the size of the vector registers (e.g. to 32 elements in the MRISC32, or 64 elements in a 64-bit architecture).

childintime · on Aug 25, 2018

The Mill is far ahead of everybody else, and even vectorizes regular for loops. A most beautiful architecture, what it needs is an implementation, and get out of their bubble. Intel and the Mill are destined for each other, but they don't seem to know how to join forces.

g0xA52A2A · on Aug 26, 2018

> but they don't seem to know how to join forces.

I don't understand what you mean by this? The Mill team seem intent of producing a chip of their own.

childintime · on Aug 27, 2018

Indeed. Intel sees the Mill as insignificant, and itself as the Gorilla in the room. But in terms of the architecture, the Mill is on the order of 10x better. That's incredible. While Intel tends to rely on strong-arming it is the #1 Mill enemy while it has to go alone. But it should definitely not be so.

g0xA52A2A · on Aug 24, 2018

Fun fact about the Cray-1, originally it didn't obey commutative law. It was "fixed" by ensuring inputs were sorted.

lisk1 · on Aug 25, 2018

Excellent project hope it will keep developing in future. We need projects like these to detach from the grip of certain vendors. This is interesting project also based on expired patents on SuperH CPU, j-core.org also they have nice and informative video presentation on youtube

_iiu1 · on Aug 25, 2018

this is very cool! any idea what sort of gate-count we are looking at for synthesizing the VHDL? I'm not familiar with the Intel FPGA you refer to on the docs for the VHDL implementation:

https://github.com/mbitsnbites/mrisc32/tree/master/mrisc32-a...

mbitsnbites · on Aug 25, 2018

I'm still learning FPGA:s myself, so I'm not sure if I can give you a relevant figure. Currently the A1 pipeline (no memory subsystem) synthesises to 2000-3000 ALMs (or about 15% of the logic in the device I'm targeting). That will increase as I implement more instructions (the FPU in particular), but on the other hand it will probably decrease when I switch to using BRAM for the scalar register file.

mbitsnbites · on Aug 26, 2018

Update: I just moved the scalar register file to block RAM, and now the pipeline uses ~1800 ALMs, or 9% of the chip logic.