Hacker News new | past | comments | ask | show | jobs | submit login
BOOM Open Source RISC-V Core Runs on Amazon EC2 F1 Instances (cnx-software.com)
122 points by ingve on Dec 13, 2018 | hide | past | favorite | 52 comments



Presentation slides (pdf) from Hot Chips 2018 (tape out): http://www.hotchips.org/hc30/1conf/1.03_Berkeley_BROOM_HC30....


And in video form, if that's your thing: https://www.youtube.com/watch?v=_mZORF2vHAA&feature=youtu.be....


And this video with even more information, including future directions.

https://youtu.be/sI6Z21ljXsw


For a more visual demo about what this article is talking about, here's a (real-speed) GIF of posting a tweet from BOOM running on EC2 F1:

https://twitter.com/firesimproject/status/103126763730350899...


What FPGA do you need to use for this cpu otherwise? I would think that it would be a lot cheaper to have one locally if you were still developing or testing out your application or code.


Unless you cut out floating point and superscalar, it requires a fairly large FPGA, on the order of >$1000. So you'd have to put it to serious use to break even with using AWS.

The other advantage of AWS is scale-out once you want to get benchmark data and not wait forever (or to simulate warehouse scale computers). In the past, maintaining our own FPGA cluster was a nightmare.


I haven't tried BOOM specifically, but if you're interested in a large FPGA that can be acquired very cheaply, you should look into the Pano Logic G2 thin client. They used to be available in bulk for $3 a piece, but those have now all been snapped up by hobbyists.

But you can still get them for $20 or so on eBay.

They have a Spartan-6 LX150 (DVI version, revB) or Spartan-6 LX100 (DVI version, revC), which are really huge FPGAs. It's by far the best deal in town if you're interested in that kind of stuff.

Google for "github panologic-g2" to find up-to-date proceedings about the reverse engineering effort.

Right now, the FPGA is up and running and outputting an image over DVI.

Do you have any idea about how many FFs are needed for a BOOM CPU?


I don't know anything about FPGAs, are all FPGAs compatible? Assuming this fits on an FPGA, does it work on every FPGA?


No, the IO plumbing and infrastructure are different. FPGAs are a total pain because each platform is different. :'(


I don't remember numbers (and resources aren't always apples-to-apples), but it's bigger than a Zedboard and 60% of a zc706 I believe. It's a ton of wires is the big issue I suspect.


Ok. The FPGA in a ZC706 is about double the size of the Spartan-6 LX150.

That's a lot of logic!


For this specific RISC-V variant, it seems like it requires relative expensive FPGAs. There are some other RSIC-V cores that you can play with which run on small and cheap FPGAs. I’ve played with the picorv/picosoc project on a TinyFPGA BX and those are easy to find (recently sold out at some shops but crowdsupply still has some of you can’t wait a month for the next batch of boards) and cheap.

There are also a lot of new FPGA dev boards with good community support for folks looking at getting their feet wet, like the icebreaker (crowdsupply) and alchitry (Kickstarter).


Are there any good visual examples of how an out-of-order pipeline processes instructions compared to traditional ones? What's the main advantage?


Most (all?) modern processors utilize out-of-order execution. The main advantage is that it increases IPC (instructions per cycle) in the long run.

The high-level idea is that you queue up your instructions in a table of sorts and execute them as they're ready (i.e., based on dependencies). You then re-order the instructions at the end of the pipeline.

The main challenges with this technique are: (1) handling exceptions and (2) speculative execution. For (1), you can use a re-order buffer. For (2), you would typically flush the instruction table noted above.


Let me add a couple thoughts:

1. You typically won't get very far without also implementing register renaming, such that the reorder buffer is actually much larger than the register file.

2. You need to distinguish between instructions that are "complete" (result is available for internal bypass to another instruction that is being speculatively executed) versus "retired" (all instructions before it have retired, the instruction can no longer raise an exception, you know for certain that it was not executed because of incorrect speculation). Canonical, architecturally visible state is only written on retirement.

3. Without good branch prediction, it is kind of pointless. A reasonable rule-of-thumb-Kentucky-windage estimate is that 20-25% of the instructions that you encounter in execution order (not static code analysis) are branches. So if you have 30 or more instructions active in the pipe, your branch prediction needs to be pretty good or you waste too much work. (Note that some branches are just "flakey" (hard to predict) but the compiler can usually do a decent job of identifying those ahead of time. Thus, modern processors include a conditional-move instruction which allows the compiler to manage speculation and replace the flakey branch with an instruction that does not pollute the branch target buffer.

4. Watching what happens in the simulator at the end of a pointer chasing loop is good humor. A strongly predicted branch gets mispredicted on: while(p!=NULL) {}, and all sorts of page-fault, memory-exception, arithmetic-exception logic lights up all over the place until something asserts "OOPS_NEVER_MIND", at which point all that fun stuff collapses in a smoldering heap. Watching that never gets old.


> The main challenges with this technique are: (1) handling exceptions

RISC-V has an advantage here: other than load/store and instruction fetch/decode, the instruction set is designed to not cause exceptions. For instance, integer division by zero is defined to return a specific value, instead of being an error.


Also, RISC-V doesn't have a flags register. Instead fused cmp+jmp branching instructions.


This is interesting to hear, and surprising. What are the provisions for efficient handling of integer overflows & underflows? I took a quick look at the ISA spec and am trying to track down how this is handled, as a curiosity.


AFAIU, none. It's designed for C, essentially, where signed integer overflows are undefined. In practice, the easiest and fastest design is to have it wrap around, just like unsigned.

I'm not sure I agree with that particular design decision, considering newer languages such as Rust that want to detect overflows. Then again, if/when such languages become more important, I guess RISC-V could add an extension featuring some kind of trapping overflow (or whatever mechanism is deemed best).


The RISC-V spec does claim that "many overflow checks can be cheaply implemented using branches" and then provides recommended instruction sequences. One could definitely imagine a RV implementation where e.g. macro-op fusion is used to systematically turn some of these recommended sequences into hardware-assisted branches.

(Ref: search for the "Integer Computational Instructions" section in the RISC-V "user mode" reference. It should be section 2.4)


Not mentioned there: you can if you want replace the BLTU (for carry) with a SLTU to set a register to 0 if no carry and 1 if there is a carry, effectively getting you a condition code register in a normal register, at the cost of one instruction (or three for signed overflow).

If you're going to be doing a lot of signed overflow checking (e.g. JavaScript) it may be better to store your numbers as N+0x8000000 (for 32 bit). Then each addition needs to subtract the offset afterwards, and each subtraction needs to add it (which works out to the same thing). The overflow check is then a single branch after the add and the adjust, for a total of three instructions instead of a total of four.

This seems bad, but if you're loading the values from memory and putting them back afterwards then it's pretty minor.

A later extension (such as J) could add instructions such as BADC/BADV/BSUC/BSUV a,b,label -- "Branch if ADd generates Carry" etc. These would fit into the existing instruction encoding and execution pipelines no problem.

Or, if this is a rare thing (as it should be in JavaScript) then making them "Trap if ADd generates oVerflow" would take a tiny amount of opcode space and allow the trap handler (which can be delegated to User space) to perhaps transparently fix up the result.


This. Also the J extention should add some more stuff in that direction.


I can appreciate the implementation benefits in not having a single mutating global state register.

One option I just thought of is simply to have multi-output variants of the instruction which would take an extra flags register as an output operand.

Anyway that's really interesting to find out. It would complicate the optimized implementation of JS engines, which need it to promote int32 arithmetic to doubles where needed, which is what I was thinking of when I asked the question.


The ARM Cortex-A55 is modern and in-order. Out-of-order CPUs tend to be large so if you want a small CPU you'll still go with in-order.


Energy/power efficiency might be a more relevant concern than mere size - OOO execution is fast but energy intensive, for a given workload. (Granted, the latest process nodes might be a tad different, with "size" also impacting power consumption directly due to leakage currents. This stuff can get a bit complicated.)


Interesting. Is lower power OOE an area of active research?


Well, to the extent that improved process brings lower power it's an area of stupendous research investment. If you just mean architecturally there are people looking at ways to use less power by being less flexible but still get most of the benefit but it's not a huge topic.


The Mill processor architecture deserves mention here, as it basically aims to bridge VLIW-like low power usage with OOO-like performance. The way it attempts to do this is too complex to explain in a short comment, you can find detailed info on their website - but broadly-speaking it involves all sorts of 'tweaks' to ordinary VLIW, that ultimately result in something a bit different. (It should also be noted though that the initial, instruction decoding stages of a Mill-like 'belt' architecture are a lot closer to OOO than an ordinary VLIW. So the projected power savings would seemingly have to come from elsewhere - but they can be quite significant nonetheless, so the effort is projected to be worthwhile.)


I imagine (because there are not that many details on the Mill out there) that the power savings come mainly from:

- A post-compiler compilation step that adapts a binary for a specific chip, replacing a complex hardware control module by a one-time software run, so the only thing actually on the chip is the routing;

- A very nice set of primitives that push the non-determinism into data, instead of flow control.

- Cheap interruptions, cheap memory access, and every other detail rethought to be cheap.


Interesting I remember hearing about this processor architecture. Is is still being actively developed? I didn't see much in the way of news on the Mill site.


There are a few recent forum posts indicating that they're still working on it at least, but it seems to be going extremely slowly.

5 years ago it had already been in development for 10 years and there was supposed to be silicon shipping in 2-3 years. 2 years ago they were talking about an FPGA prototype that has yet to show up.


Note that even an in-order processor that is pipelined and has a branch predictor to not stall the pipeline on every conditional branch does "speculative execution" (branch predictor can guess wrong and flushing the pipeline is required).

https://en.wikipedia.org/wiki/Branch_predictor#History


Sort of. Later instructions are fetched and decoded speculatively, but because the processor is in-order they don't get to the execute stage (let alone write-back) before the misprediction is discovered. All that has to be flushed is the fetch and decode, which doesn't change any visible state anyway. (hmm .. except icache contents .. the mispredicted branch might evict something else from the icache .. but the load from L2 or main memory will probably take longer than the pipeline length, so could be shot down before updating the icache)


The documentation on the BOOM processor mentioned in this article seems to be a good source of information: https://docs.boom-core.org/en/latest/sections/RobDispatch/ro.... Link goes specifically to one component, the reorder buffer, the rest of the docs discuss more of the components that are necessary to make it work.


Rather than stalling while waiting for long running operations (like accessing main memory), it executes unrelated operations while waiting for them to complete. Performance gains are dependent on the details of the design, but are pretty significant in general.


There is a great paper describing very simple OoO risc cpu with nice diagrams and a walk through a few execution cycles - https://ece.umd.edu/~blj/RiSC/RiSC-oo.1.pdf


Suppose you do a+b. If a or b is not available the cpu will run the instruction after the a+b instruction. Otherwise the cpu is waiting on a and b to be available and doing nothing.


https://youtu.be/WC5Bo9UdI2w

The main advantage is higher IPC -> better performance since there is less idle time spent waiting for slow stuff like reading from disk.


Less time spent waiting: yes.

Like reading from disk: no.

The max amount of out-of-order instructions is typically around 300 these days?

Reading from disk takes millions of cycles.

It’s about avoid waits when reading from L1 and L2 cache and DRAM.


Yup, reading from disk will trigger a context switch from the OS, whether it happens explicitly or via a page fault involving virtual memory. (The closest parallel to that at the hardware level would be SMT.) As an aside, another rationale - perhaps a more paradigmatic one - for OOO involves achieving full use of super-scalar execution units on a processor core, when instructions have complex dependencies on one another.


You can get some overview on the chapters about the Pentium on Michael Abrash’s "Graphics Programming Black Book".


I wish it, or a cut down version, would run on a cheaper FPGA (like the Zynq or Zedboard, ~250$). Running on the EC2 F1 instance is not really an option for students given that it costs about ~10$ per day.


It's a comparatively big, application-class CPU. You're asking too much. Those cheaper FPGAs are cheaper because they're smaller, and BOOM won't fit.


Not entirely unrealistic. Removing the FPU and reducing the width should get it to fit.


But then it would not be an application class processor.


Try j-Core instead: http://j-core.org

They can do non-MMU SMT which is kinda cool too.


lowrisc will run linux on a $265 fpga board


Little thing, this OUT OF ORDER cpu was created IN ORDER to...


My parents thought it was quite odd that I named it the Berkeley Out of Order Machine. "Why would you advertise to the world that it doesn't work?"


BOOM really needs to get compressed instruction support. Until that's in I would assume development has stalled since it's required to run precompiled Debian and Fedora packages.


What? Because the feature you want is not there, development has stalled?

There is lots of useful things you can do without C. Also, it has not stalled:

https://youtu.be/sI6Z21ljXsw


My wording was a bit harsh. The git repository hasn't been very active, but not dead.

The C extension is mentioned in that video as a priority but it sounds like some of the other priorities are being worked on (maybe there's funding for spectre mitigation work). Anyone who wants to use BOOM outside of research is going to need the C option because that's the standard for OSes. They said it's nice not to have to maintain a fork of Rocket Chip, the same will be true of not needing to recompile everything without C.

It is more active than I thought.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: