A Single-Cycle 64-Bit RISC-V Register File

kouteiheika · on Aug 4, 2023

> All RISC-V architectures use 32 general purpose registers (GPRs)

Strictly speaking this is incorrect. There are variants called rv32e/rv64e which have 16 general purpose registers. These are particularly nice as a VM ISA since the reduced number of registers allows you to JIT the code to AMD64 (which itself only has 16 GPRs) without having to spill the registers.

bpye · on Aug 4, 2023

There are some MCUs that use RV32E also - the CH32V003 [0] comes to mind.

[0] - https://github.com/openwch/ch32v003

bonzini · on Aug 4, 2023

In practice you will need some extra state, for example the RISC-V program counter and the x86 stack, so it's not really possible to map registers 1:1. Also some instructions on x86 need extra registers, or they can only refer to fixed registers (e.g. EAX and EDX for multiply-high).

kouteiheika · on Aug 7, 2023

> so it's not really possible to map registers 1:1

Oh it's definitely possible. I know this because I'm writing such a VM right now. (:

But I should have been a little more precise, sorry. By "without having to spill the registers" I didn't mean that you don't have to ever temporarily push anything to the stack, just that you can keep all of the RISC-V registers in native AMD64 registers all the time instead of having to keep some of them in memory (which you'd have to do with a full fat 32-register RISC-V ISA).

In general with rv32e/rv64e at most you only need 15 registers because one of the registers is the zero register. This allows you to assign every RISC-V register to an AMD64 register and keep rsp for the native stack. And if you only want to support bare metal userspace RISC-V code with normal ABI then it gets even easier because you only need 13 registers (gp and tp will be unused), so you'll have two extra temporary AMD64 registers to use.

> Also some instructions on x86 need extra registers, or they can only refer to fixed registers (e.g. EAX and EDX for multiply-high).

Yes. For example, the shift instruction hardcodes the shift amount in rcx. For the 15 register RISC-V you'll push rcx temporarily to the stack, and pop it after the shift, and for the 13 register RISC-V you can just designate rcx as one of your temporary registers, so you only need an extra mov here. Or you can use the shift instruction from BMI2, which doesn't hardcode the register. There are ways to handle these.

hasheddan · on Aug 4, 2023

Author here -- that is a great call out, thanks for the correction!

codedokode · on Aug 4, 2023

Sadly the article explains nothing. For example, it doesn't explain how the Verilog (or VHDL) code will be implemented at gate level, is it expensive or not (in terms of gate count), how long the critical path will be, can we pipeline it, how to implement several writing ports etc.

I think that one shouldn't write Verilog/VHDL code unless one clearly understands how it will be transformed into gates.

hasheddan · on Aug 4, 2023

Author here -- thanks for the feedback! This is a quick post that is focused on how to logically think about the circuit, but I agree that all of the attributes you enumerated are valuable information as well. I plan to continue diving deeper in future posts. I am currently posting every Friday as part of my goal of gaining a deep understanding of chip design[0]. Please feel free to continue to provide feedback on future posts as well!

[0]: https://danielmangum.com/posts/a-three-year-bet-on-chip-desi...

therealcamino · on Aug 4, 2023

I think the article is fine but what you're really driving towards is the idea of synchronous design, and how that simplifies design and analysis. Some of the objections here might go away if it were framed that way.

hasheddan · on Aug 4, 2023

That's a great point. Thanks for the feedback!

boesboes · on Aug 4, 2023

I liked the post. as a novice with fpgas and hdls, I find posts like this very approachable. Refreshing tbh

perfopt · on Aug 5, 2023

Perhaps showing the synthesized gates and how say a structural version of the code will synthesize would explain in more detail

boesboes · on Aug 4, 2023

Maybe you are not the target audience? I found it a decently informative read myself.

Why is it necessary to understand how verilog is exactly transformed into gates? Sounds like gatekeeping to me, but please enlighten me. I have in the past written hdl and I didn’t even have a FPGA

d_tr · on Aug 4, 2023

While it is not strictly necessary in all scenarios, I think you will just not be able to get efficient circuits if you do not have a sense of what type of logic your verilog code snippets map to.

I honestly believe that if you do not understand at least some basics, things will be way more painful and opaque than necessary, so it is a no-brainer IMO.

You also do not need to have an FPGA to learn this stuff.

codedokode · on Aug 4, 2023

Because Verilog is not C where typically a line of code turns into several machine code commands. And even if you write inefficient code, CPUs are so fast that you won't notice it. In Verilog a simple "+" or "*" can turn into thousands of gates. Without understanding this your designs will be inefficient and slow. Unlike CPU cycles, silicon and transistors are not free.

Also for a design to be fast you need to use pipelining and to slice your combinational logic into thin layers. Again, you won't be able to do it without understanding how your code translates to gates.

minipci1321 · on Aug 4, 2023

> And even if you write inefficient code, CPUs are so fast that you won't notice it.

I think you are arguing for the right cause but with wrong arguments. In C language, not understanding in details what exactly the code does, can be just as disastrous as in Verilog. Internet examples abound.

RetroTechie · on Aug 4, 2023

> In Verilog a simple "+" or "*" can turn into thousands of gates. Without understanding this your designs will be inefficient and slow.

Trying hard to avoid this because it is inefficient, is (imho) a typical case of premature optimization. Who cares whether you're wasting CPU cycles (in C) or FPGA logic (Verilog & co) when you're just learning what's what?

But there's a better reason: to make sure that whoever is learning Verilog (or other HDL) actually understands what they are trying to do.

With that understanding in place, the learning becomes "what logic construct(s) would be suitable" followed by "what Verilog to write that describes that logic".

Versus: "Verilog intro course has this example, does it compile? And does it appear to do what I think the Verilog says it should do?".

In programmer's parlance: semantics vs. syntax. The algorithm + how it maps to a CPU's resources, vs. red tape required to implement it.

If you 'feel' the syntax but underlying semantics is black magic, then you could keep stumbling in the dark forever. Spitting out copypasta along the way.

But if you understand basic concepts of the underlying hw (and how you intend to use that), learning syntax to describe that is straightforward.

alain94040 · on Aug 4, 2023

It goes beyond being inefficient. It's the wrong mental model.

Chip designers intuitively visualize interconnected blocks of logic. They draw rectangles with wires between them.

To learn this intuition, writing spaghetti Verilog code that runs in a simulator is counter-productive. It's a common mistake that people coming from a software background make all the time, and that has been discussed to death. So with some appeal to authority, trust us and learn proper hardware design, hopefully it will be more rewarding and will "click" faster.

imtringued · on Aug 5, 2023

I don't understand where this sentiment is coming from. I have seen so many logs on HN that tryo to explain the "gotcha" difference between hardware and software and I don't understand who they are targeting. For me it is obvious that VHDL maps to hardware. When I got to actually writing VHDL I was like "where is the gotcha? These people lied to me. Hardware works exactly like hardware!".

It feels condescending to read on the internet how hardware engineers clain that software engineers don't know how hardware works. Any kid that has played around with Redstone intuitively knows the difference and in fact children are more likely to have extensive experience with hardware than with software because most games have some sort of logic gate system but only a tiny minority lets you program unless the entire game is based around programming.

monocasa · on Aug 4, 2023

You don't normally write register files in an HDL anyway (at least in the hard and soft flows I've dealt with). You don't want to build them out of regular logic so you end up wrapping some hard IP blocks. Block RAM on FPGAs, or the output of register file hard block generators on ASIC flows. The HDL just ends up being an interface boundary that gets swapped out for simulation.

bpye · on Aug 4, 2023

I think in an FPGA flow you could rely on inferring a block RAM no?

gsmecher · on Aug 4, 2023

On FPGAs, a register file probably fits better into distributed RAM than block RAM.

On Xilinx, for example: a 64-bit register file doesn't map efficiently to Xilinx's RAMB36 primitives. You'd need 2 RAMB36 primitives to provide a 64-bit wide memory with 1 write port and 2 read ports, each addressed separately. Only 6% (32 of 512) entries in each RAMB36 are ever addressable. It's this inefficient because ports, not memory cells, are the contented resource and BRAMs geometries aren't that elastic.

A 64-bit register file in distributed RAM, conversely, is a something like an array of DPRAM32 primitives (see, for example, UG474). Each register would still be stored multiple times to provide additional ports, but depending on the fabric, there's less (or no) unaddressed storage cells.

The Minimax RISC-V CPU (https://github.com/gsmecher/minimax; advertisement warning: my project) is what you get if you chase efficient mapping of FPGA memory primitives (both register-file and RAM) to a logical conclusion. Whether this is actually worth hyper-optimizing really depends on the application. Usually, it's not.

d_tr · on Aug 4, 2023

It can infer lots of stuff yes. But sometimes you have to write that part in a somewhat specific way to get synthesis to infer, or you might want the extra control options that the inputs & outputs of a hard block instance offer, or you might have a more elaborate interconnection between several hard blocks and it ends up being easier to just instantiate these hard blocks and set them up manually.

publicmail · on Aug 4, 2023

In something as simple as the code in the article, yes. It's likely that the tool will infer a block RAM as long as there are BRAM resources with the required number of ports, etc.

It gets a little more unreliable when you start accessing it in more complex ways though.

From my understanding (I'm no FPGA expert), the code in the article will infer a BRAM with two read ports and one write port. That may be fine.

I actually battled with this recently on a project. I found that the tool was not inferring a block RAM when I expected it to, so I had to modify the Verilog to gate the reads and writes so that only one could happen at a time. That wasn't an issue in my case though.

My takeaway from the exercise was that it's sort of the equivalent of relying on the optimizer of a compiler to recognize the programming pattern and do the right thing. After talking to one of the FPGA guys I work with, he seemed to feel that it's better to just instantiate a vendor IP BRAM directly. The downside is portability though.

monocasa · on Aug 4, 2023

I've found that the synthesizer can infer a lot of the time, but I wouldn't say you can rely on it.

devit · on Aug 4, 2023

It seems you can get a diagram of the actual circuit like this:

1. Go to https://edaplayground.com

2. Paste the code in the right pane labeled "design.sv"; it's probably a good idea to reduce the number of registers to 2 to make the output more readable

3. On the left, select "Yosys" in tools and simulators (it seems to be the only one that works without additional configuration or fiddling) and enable "Show diagram after run"

4. Click "Run" and it should open an image with a graph rendering showing the circuit. If it doesn't work, try to disable adblockers or popup blockers (it opens the output as a pop-up apparently)

You can also presumably run yosys locally.

In this case, it implements every register as a D flip-flop; the rs outputs are implemented with muxes from the flip flops, and the D flip-flop value is set to either the current value of the register or to data depending whether write_ctrl is set and rd is equal to the register number.

alain94040 · on Aug 4, 2023

Correct. The code shown is only good enough for RTL simulation. It won't help actually generate a chip with multiple read and write ports to a register file.

tverbeure · on Aug 4, 2023

On an FPGA, it will infer a dual-ported BRAM memory just fine. On ASIC, it will generate a flip-flop array, which is good enough for many CPUs.

E.g. in this physical layout of a Swerv RISC-V CPU, the ARF section is the register file, built out of flip-flops: https://tomverbeure.github.io/2019/03/13/SweRV.html#swerv-ph....

alain94040 · on Aug 4, 2023

Also, the read ports are asynchronous, which is bad:

    assign rv1 = registers[rs1];
    assign rv2 = registers[rs2];

therealcamino · on Aug 4, 2023

That code is synthesizeable and will work fine in an ASIC flow.

okl · on Aug 4, 2023

It's Verilog in this article.

jhallenworld · on Aug 4, 2023

In many RAMs, if you read and write to the same address on the same edge, the read returns the old RAM contents, not the new one. So you have to bypass the RAM- you add a mux and a register to use the new data instead of the old if the addresses happened to be the same. In FPGAs, this bypass logic is sometimes automatically generated, depending on the coding style you used for inferred RAM.

This is in addition to bypassing the ALU stage.

So the computer architecture question is whether or not massive amounts of bypassing (with resulting slower fMAX) is worth it, vs. stalling. You need a representative test suite to answer this.

IshKebab · on Aug 4, 2023

This isn't RAM, and you want to read the old value anyway.

pbazarnik · on Aug 5, 2023

Hardware function implementations deal with multidimensional design space metrics: performance (speed), power and area. These should be taken into account in any work intended for more than just digital design basics.

This paper provides a modern take on these issues:

https://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-...