Hacker News new | past | comments | ask | show | jobs | submit login
Box64 and RISC-V in 2024: What It Takes to Run the Witcher 3 on RISC-V (box86.org)
366 points by pabs3 3 months ago | hide | past | favorite | 141 comments



Question for somebody who doesn't work in chips: what does a software engineer has to do differently when targeting software for RISC5?

I would imagine that executable size increases, meaning it has to be aggressively optimized for cache locality?

I would imagine that some types of softwares are better suited for either CISC or RISC, like games, webservers?


RISC-V with the compressed instruction extension actually ends up smaller than x86-64 and ARM on average.

There's not much inherent that needs to change in software approach. Probably the biggest thing vs x86-64 is the availability of 32 registers (vs 16 on x86-64), allowing for more intermediate values before things start spilling to stack, which also applies to ARM (which too has 32 registers). But generally it doesn't matter unless you're micro-optimizing.

More micro-optimization things might include:

- The vector extension (aka V or RVV) isn't in the base rv64gc ISA, so you might not get SIMD optimizations depending on the target; whereas x86-64 and aarch64 have SSE2 and NEON (128-bit SIMD) in their base.

- Similarly, no popcount & count leading/trailing zeroes in base rv64gc (requires Zbb); base x86-64 doesn't have popcount, but does have clz/ctz. aarch64 has all.

- Less efficient branchless select, i.e. "a ? b : c"; takes ~4-5 instrs on base rv64gc, 3 with Zicond, but 1 on x86-64 and aarch64. Some hardware can also fuse a jump over a mv instruction to be effectively branchless, but that's even more target-specific.

RISC-V profiles kind of solve the first two issues (e.g. Android requires rva23, which requires rvv & Zbb & Zicond among other things) but if linux distros decide to target rva20/rv64gc then they're ~forever stuck without having those extensions in precompiled code that hasn't bothered with dynamic dispatch. Though this is a problem with x86-64 too (much less so with ARM as it doesn't have that many extensions; SVE is probably the biggest thing by far, and still not supported widely (i.e. Apple silicon doesn't)).


That seems like something the compiler would generally handle, no? Obviously that doesn't apply everywhere, but in the general case it should.


Vector stuff is typically hand coded with intrinsics or assembly. Autovectorization has mixed results because there’s no way to request the compiler to promise that it vectorized the code.

But for an emulator like this, box64 has to pick how to emulate vectorized instructions on RiscV (eg slowly using scalars or trying to reimplement using native vector instructions). The challenge of course is that typically you don’t get as good a performance unless the emulator can actually rewrite the code on the fly because a 1:1 mapping is going to be suboptimal vs noticing patterns of high level operations being performed and providing a more optimized implementation that replaces an alternate chunk of instructions at once instead to account for implementation differences on the chip (eg you may have to emulate missing instructions but a rewriter could skip emulation if there’s an alternate way to accomplish the same high level computation)

The biggest challenge for something like this from a performance perspective of course will be translating the GPU stuff efficiently to hit the native driver code and that Riscv likely is relying on OSS GPU drivers (and maybe wine to add another translation layer if the game is windows only )


I'd assume it uses RADV, same as the Steam Deck. For most workloads that's faster than AMD's own driver. And yes, it uses Wine and DXVK. As dar as the game is concerned it's running on a DirectX-capable x86 Windows machine. That's a lot of translation layers.


On clang, you can actually request that it gives a warning on missed vectorization of a given loop with "#pragma clang loop vectorize(enable)": https://godbolt.org/z/sP7drPqMT (and you can even make it an error).

There's even "#pragma clang loop vectorize(assume_safety)" to tell it that pointer aliasing won't be an issue (gcc has a similar "#pragma GCC ivdep"), which should get rid of most odd reasons for missed vectorization.


> Vector stuff is typically hand coded with intrinsics or assembly. Autovectorization has mixed results because there’s no way to request the compiler to promise that it vectorized the code.

Right, but most of the time those are architecture specific and RVV 1.0 is substantially different than say, NEON or SSE2, so you need to change it anyways. You also typically use specialized registers for those, not the general purpose registers. I'm not saying there isn't work to be done (especially in for an application like this one, that is extremely performance sensitive), I'm saying that most applications won't have these problems are be so sensitive that register spills matter much if at all.


I’m highlighting that the compiler doesn’t automatically take care of vector code quite as automatically and as well as it does register allocation and instruction selection which are slightly more solved problems. And it’s easy to imagine that a compiler will fail to optimize a piece of code as well on something that’s architecturally quite novel. RISCV and ARM aren’t actually hugely dissimilar architectures at a high level that completely different optimization need to be written and even selectively weighted by architecture, but I imagine something like a Mill CPU might require quite a reimagining to get anything approaching optimal performance.


I read somewhere that since floating point addition is not associative the compiler will not autovectorize because the order might change.


It’s somewhat more complicated than that (& presumed your hot path is floating point instead of integral), but that can be a consideration.


What are the other considerations? (assuming we are dealing with FP)


Disclaimer: not an expert here so could be very very wrong. This is just my understanding so happy to be corrected.

Another would be that something like fused multiple add would have different (higher if I recall correctly) precision which violates IEE754 and thus vectorization since default options are standard compliant.

Another is that some math intrinsics are documented to populate errno which would prevent using autovec in paths that have an intrinsic.

There may be other nuances depending on float vs double.

Basically most of the things that make up ffast-math i believe would prevent autovectorization.


Fused multiply add applies equally to scalar and vectorized code (and C actually allows compilers to fuse them; there's -ffp-contract=off / the FP_CONTRACT pragma to turn that off); the compiler/autovectorizer can trivially just leave multiply & add as separate if so requested (slower than having them fused? perhaps. But no impact at all on scalar vs vector given that both have the same fma applicability).

For <math.h> errno, there's -fno-math-errno; indeed included in -ffast-math, but you don't need the entirety of that mess for this.

Loops with a float accumulator is I believe the only case where -ffast-math is actually required for autovectorizability (and even then iirc there are some sub-flags such that you can get the associativity-assuming optimizations while still allowing NaN/inf).


It's something that the compiler would handle, but can still moderately influence programming decisions, i.e. you can have a lot more temporary variables before things start slowing down due to spill stores/loads (esp. in, say, a loop with function calls, as more registers also means more non-volatile registers (i.e. those that are guaranteed to not change across function calls)). But, yes, very limited impact even then.


It's certainly something I would take into consideration when making a (language) runtime, but probably not at all during all but the most performance sensitive of applications. Certainly a difference, but far lower level than what most applications require


Yep. Unfortunately I am one to be making language runtimes :)

It's just the potentially most significant thing I could come up with at first. Though perhaps RVV not being in rva20/rv64gc is more significant.


Looks like an APL project? That's really cool!


> Question for somebody who doesn't work in chips: what does a software engineer has to do differently when targeting software for RISC5?

Most of the time, nothing; code correctly written on higher-level languages like C should work the same. The biggest difference, the weaker memory model, is something you also have on most non-x86 architectures like ARM (and your code shouldn't be depending on having a strong memory model in the first place).

> I would imagine that executable size increases, meaning it has to be aggressively optimized for cache locality?

For historical reasons, executable code density on x86 is not that good, so the executable size won't increase as much as you'd expect; both RISC-V with its compressed instructions extension and 32-bit ARM with its Thumb extensions are fairly compact (there was an early RISC-V paper which did that code size comparison, if you want to find out more).

> I would imagine that some types of softwares are better suited for either CISC or RISC, like games, webservers?

What matters most is not CISC vs RISC, but the presence and quality of things like vector instructions and cryptography extensions. Some kinds of software like video encoding and decoding heavily depend on vector instructions to have good performance, and things like full disk encryption or hashing can be helped by specialized instructions to accelerate specific algorithms like AES and SHA256.


No, any ISA pretty much should be equally good for any type of workload. If you are doing assembly programming then it makes a difference but if you were doing something in Python or Unity it really isn’t going to matter.

This is more about being free of ARM’s patents and getting a fresh start using the lessons learned


Reminded me how one famous Russian guy ran Atomic Heart on Elbrus 8S.

Elbrus has native translator, though, and pretty good one, afaik. Atomic Heart was kinda playable, 15-25 fps.



Elbrus is/was RISC?-V?



Nah, it is fully custom VLIW


Article is a bit short on "the basics" - I assumed they used some kind of wine port to run it. But it seems they implemented the x86_64 ISA on a RISC-V chip in some way - anyone can shed more light on that part how that is done?


The basics are here: https://box86.org/ It is an emulator but:

> Because box86 uses the native versions of some “system” libraries, like libc, libm, SDL, and OpenGL, it’s easy to integrate and use with most applications, and performance can be surprisingly high in some cases.

Wine can also be compiled/run as native.


> Wine can also be compiled/run as native.

I'm not sure you can run Wine natively to run x86 Windows programs on RISC-V because Wine is not an emulator. There is an ARM port of Wine, but that can only run Windows ARM programs, not x86.

Instead box64 is running the x86_64 Wine https://github.com/ptitSeb/box64/blob/main/docs/X64WINE.md


It should be theoretically possible to build Wine so that it provides the x86_64 API while compiling it to ARM/RISCV. Your link doesn't make it clear if that's what's being done or not.

(Although I suspect providing the API of one architecture while building for another is far easier said than done. Toolchains tend to be uncooperative about such shenanigans, for starters.)


Box64's documentation is just on installing the Wine x64 builds from winehq repos, because most arm repos aren't exactly hosting x64 software. It's even possible to run Steam with their x64 Proton running Windows games. At least on ARM, not sure about RISC-V.

Wine's own documentation says it requires an emulator: https://wiki.winehq.org/Emulation

> As Wine Is Not an Emulator, all those applications can't run on other architectures with Wine alone.

Or do you mean provide the x86_64 Windows API as a native RISC-V/ARM to the emulator layer? That would require some deeper integration for the emulator, but that's what Box64/box86 already does with some Linux libraries: intercept the api calls and replace them with native libraries. Not sure if it does it for wine


> but that's what Box64/box86 already does with some Linux libraries: intercept the api calls and replace them with native libraries. Not sure if it does it for wine

Yeah, that's what I meant. It's simple in principle, after all: turn an AMD64 call into an ARM/RISCV call and pass it to native code.

Doing that for Wine would be pretty tricky (way more surface area to cover, possible differences between certain Win32 arch-specific structs and so forth) so I bet that's not how it works out of the box, but I couldn't tell for sure by skimming through the box64 repo.


As demonstrated by Microsoft themselves in Windows 11: https://learn.microsoft.com/en-us/windows/arm/arm64ec


Incredible result! This is a tremendous amount of work and does seem like RV is at its limits in some of these cases. The bit gather and scatter instructions should become an extension!


Would be useful to see test results on a game that relies more heavily on the graphics core than the CPU. Perhaps Divinity 2?


> At least in the context of x86 emulation, among all 3 architectures we support, RISC-V is the least expressive one.

RISC was explained to me as a reduced instruction set computer in computer science history classes, but I see a lot of articles and proposed new RISC-V profiles about "we just need a few more instructions to get feature parity".

I understand that RISC-V is just a convenient alternative to other platforms for most people, but does this also mean the RISC dream is dead?


As I've heard it explained, RISC in practise is less about "an absolutely minimalist instruction set" and more about "don't add any assembly programmer conveniences or other such cleverness, rely on compilers instead of frontend silicon when possible".

Although as I recall from reading the RISC-V spec, RISC-V was rather particular about not adding "combo" instructions when common instruction sequences can be fused by the frontend.

My (far from expert) impression of RISC-V's shortcomings versus x86/ARM is more that the specs were written starting with the very basic embedded-chip stuff, and then over time more application-cpu extensions were added. (The base RV32I spec doesn't even include integer multiplication.) Unfortunately they took a long time to get around to finishing the bikeshedding on bit-twiddling and simd/vector extensions, which resulted in the current functionality gaps we're talking about.

So I don't think those gaps are due to RISC fundamentalism; there's no such thing.


Put another way, "try to avoid instructions that can't be executed in a single clock cycle, as those introduce silicon complexity".


But that's not even close to true, either, eg any division or memory operation.

In practice there's no such thing as "RISC" or "CISC" anymore really, they've all pretty much converged. At best you can say "RISC" now just means that there aren't any mixed load + alu instructions, but those aren't really used in x86 much, either


You've hit the nail on the head. Really, when people complain about CISC vs RISC, they are mostly complaining about two particular things. The first is that x86 processors carry legacy baggage (aka they have had a long history of success that continues to this day) and the second is that x86 has a lot of variable length instructions. After that, most of the complaints are very nit-picky, such as the number of general purpose registers and how they are named.


>and more about "don't add any assembly programmer conveniences or other such cleverness, rely on compilers instead of frontend silicon when possible"

What are the advantages of that?


It shifts implementation complexity from hardware onto software. It's not an inherent advantage, but an extra compiler pass is generally cheaper than increased silicon die area, for example.

On a slight tangent, from a security perspective, if your silicon is "too clever" in a way that introduces security bugs, you're screwed. On the other hand, software can be patched.


I honestly find the lack of compiler/interpreter complexity disheartening.

It often feels like as a community we don't have an interest in making better tools than those we started with.

Communicating with the compiler, and generating code with code, and getting information back from the compiler should all be standard things. In general they shouldn't be used, but if we also had better general access to profiling across our services, we could then have specialists within our teams break out the special tools and improve critical sections.

I understand that many of us work on projects with already absurd build times, but I feel that is a side effect of refusal to improve ci/cd/build tools in a similar way.

If you have ever worked on a modern TypeScript framework app, you'll understand what I mean. You can create decorators and macros talking to the TypeScript compiler and asking it to generate some extra JS or modify what it generates. And the whole framework sits there running partial re-builds and refreshing your browser for you.

It makes things like golang feel like they were made in the 80s.

Freaking golang... I get it, macros and decorators and generics are over-used. But I am making a library to standardize something across all 2,100 developers within my company... I need some meta-programming tools please.


I usually talk a lot about Oberon, or Limbo, however their designs were constrained by hardware costs of the 1990's, and how much more the alternatives asked for in resources.

We are three decades away from those days, with more than enough hardware to run those systems, only available in universities or companies with very deep pockets.

Yet, the Go culture hasn't updated themselves, or very reluctantly, with the usual mistakes that were already to be seen when 1.0 came out.

And since they hit gold with CNCF projects, pretty much unavoidable for some work.


complexity that the compiler removes doesn't have to be handled by the CPU at runtime


Sure but that's not necessarily at odds with "programmer conveniences or other such cleverness" is it?


it is in the sense that those are programmer conveniences only for assembly programmers and Riscv's view is that to the extent possible the assembly programmer interface should largely be handled by psuedo-instructions that disappear when your go to machine code rather than making the chip deal with them


Instructions can be completed in one clock cycle, which removes a lot of complexity compared to instructions that require multiple clock cycles.

Removed complexity means you can fit more stuff into the same amount of silicon, and have it be quicker with less power.


That's not exactly it; quite a few RISC-style instructions require multiple (sometimes many) clock cycles to complete, such as mul/div, floating point math, and branching instructions can often take more than one clock cycle as well, and then once you throw in pipelining, caches, MMUs, atomics... "one clock cycle" doesn't really mean a lot. Especially since more advanced CPUs will ideally retire multiple instructions per clock.

Sure, addition and moving bits between registers takes one clock cycle, but those kinds of instructions take one clock cycle on CISC as well. And very tiny RISC microcontrollers can take more than one cycle for adds and shifts if you're really stingy with the silicon.

(Memory operations will of course take multiple cycles too, but that's not the CPU's fault.)


>quite a few RISC-style instructions require multiple (sometimes many) clock cycles to complete, such as mul/div, floating point math

Which seems like stuff you want support for, but this is seemingly arguing against?


It seems contradictory because the "one clock per instruction" is mostly a misconception, at least with respect to anything even remotely modern.

https://retrocomputing.stackexchange.com/a/14509


Got it, so it's more about removing microcode.


The biggest divide is that no more than a single exception can occur in a RISC instruction, but you can have an indefinite number of page faults in something like an x86 rep mov.


That's not even true as you can get lots of exceptions for the same instruction. For example a load can all of these and more (but only one will be reported at a time): instruction fetch page fault, load misaligned, and load page fault.

More characteristic are assumptions about side effects (none for integer, and cumulative flags for FP) and number of register file ports needed.


In order to have an instruction set that a student can implement in a single semester class you need to make simplifications like having all instructions have two inputs and one output. That also makes the lives of researchers experimenting one processor design a lot simpler as well. But it does mean that some convenient instructions are off the table for getting to higher performance.

That's not the whole story, a simpler pipeline takes less engineering resources for teams going to a high performance design so they can spend more time optimizing.

RISC is generally a philosophy of simplification but you can take it further or less far. MIPS is almost as simplified as RISC-V but ARM and POWER are more moderate in their simplifications and seem to have no trouble going toe to toe with x86 in high performance arenas.

But remember there are many niches for processors out there besides running applications. Embedded, accelerators, etc. In the specific niche of application cores I'm a bit pessimistic about RISC-V but from a broader view I think it has a lot of potential and will probably come to dominate at least a few commercial niches as well as being a wonderful teaching and research tool.


The RISC dream was to simplify CPU design because most software was written using compilers and not direct assembly.

Characteristics of classical RISC:

- Most data manipulation instructions work only with registers.

- Memory instructions are generally load/store to registers only.

- That means you need lots of registers.

- Do your own stack because you have to manually manipulate it to pass parameters anyway. So no CALL/JSR instruction. Implement the stack yourself using some basic instructions that load/store to the instruction pointer register directly.

- Instruction encoding is predictable and each instruction is the same size.

- More than one RISC arch has a register that always reads 0 and can't be written. Used for setting things to 0.

This worked, but then the following made it less important:

- Out-of-order execution - generally the raw instruction stream is a declaration of a path to desired results, but isn't necessarily what the CPU is really doing. Things like speculative execution, branch prediction and register renaming are behind this.

- SIMD - basically a separate wide register space with instructions that work on all values within those wide registers.

So really OOO and SIMD took over.


Is there a RISC dream? I think there is an efficiency "dream", there is a performance "dream", there is a cost "dream" — there are even low-complexity relative to cost, performance and efficiency "dreams" — but a RISC dream? Who cares more about RISC than cost, performance, efficiency and simplicity?


There was such dream. It was about getting the mind-bogglingly simple CPU, put caches into the now empty place where all the control logic used to be, and clock it up the wazoo, and let the software deal with load/branch delays, efficiently using all 64 registers, etc. That'll beat the hell out of those silly CISC architectures at performance, and at the fraction of the design and production costs!

This didn't work out, for two main reasons: first, just being able to turn clocks hella high is still not enough to get great performance: you really do want your CPU to be super-scalar, out-of-order, and with great branch predictor, if you need amazing performance. But when you do all that, the simplicity of RISC decoding stops mattering all that much, as Pentium II demonstrated when it equalled DEC Alpha on performance, while still having practically useful things like e.g. byte loads/stores. Yes, it's RISC-like instructions under the hood but that's an implementation detail, no reason to expose it to the user in the ISA, just as you don't have to expose the branch delay slots in your ISA because it's a bad idea to do so: e.g. MIPS II added 1 additional pipeline stage, and now they needed two branch/load delay slots. Whoops! So they added interlocks anyway (MIPS originally stood for "Microprocessor without Interlocked Pipelined Stages", ha-ha) and got rid of the load delays; they still left 1 branch delay slot exposed due to backwards compatibility, and the circuitry required was arguably silly.

The second reason was that the software (or compilers, to be more precise) can't really deal very well with all that stuff from the first paragraph. That's what sank Itanium. That's why nobody makes CPUs with register windows any more. And static instruction scheduling in the compilers still can't beat dynamic instruction reordering.


Great post as it is also directly applicable to invalidate the myth that the arm instruction set somehow makes the whole cpu better than analogous x86 silicon. It might be true and responsible for like 0.1% (guesstimate) of the total advantage; it's actually all RISC under the hood and both ISAs need decoders, x86 might need a slightly bigger one which amounts to accounting noise in terms of area.

c.f. https://chipsandcheese.com/2021/07/13/arm-or-x86-isa-doesnt-...


> This didn't work out

... except it did.

You had literal students design chips that outperformed industry cores that took huge teams and huge investment.

Acorn had a team of just a few people build a core that outperformed an i460 with likely 1/100 investment. Not to mention the even more expensive VAX chips.

Can you imagine how fucking baffled the DEC engineers at the time were when their absurdly complex and absurdly expensive VAX chip were smocked by a bunch of first time chip designers?

> as Pentium II demonstrated

That chip came out in 1997. The original RISC chip research happened in the early 80s or even earlier. It did work, its just that x86 was bound to the PC market and Intel had the finances huge teams hammer away at the problem. x86 was able to overtake Alpha because DEC was not doing well and they couldn't invest the required amount.

> no reason to expose it to the user in the ISA

Except that hidden the implementation is costly.

If you give 2 equal teams the same amount of money, what results in a faster chip. A team that does a simply RISC instruction set. Or a team that does a complex CISC instruction set, transforms that into an underlying simpler instruction set?

Now of course for Intel, they had backward comparability so they had to do what they had to do. They were just lucky they were able to invest so much more then all the other competitors.


> You had literal students design chips that outperformed industry cores that took huge teams and huge investment

Everyone remember to thank our trans heroine Sophie Wilson (CBE).


> If you give 2 equal teams the same amount of money, what results in a faster chip.

Depends on the amount of money. If it's less a certain amount, RISC design will be faster. If it's above, both designs will perform about the same.

I mean, look at ARM: they too have decode their instructions into micro-ops and cache those in their high-performance models. What RISC buys you is the ability to be competitive at the low end of the market, with simplistic implementations. That's why we won't ever see e.g. a stack-like machine — no exposed general-purpose registers, but with flexible addressing modes for the stack, even something like [SP+[SP+12]]; stack is mirrored onto the hidden register file which is used as an "L0" cache which neatly solves the problem that register windows were supposed to solve, — such a design can be made as fast as server-grade x86 or ARM, but only by throwing billions of dollars and several man-millenia at it; and if you try to do it cheaper and quicker, its performance would absolutely suck. That's why e.g. System/360 didn't make that design choice although IBM seriously considered it for half a year — they then found out that the low-level machines would be unacceptably slow so they went with "registers with base-plus-offset addressed memory" design.


All fine except Itanium happened and it goes against everything you list out...?


Itanium was not in any sensible way RISC, it was "VLIW". That pushed a lot of needless complexity into compilers and didn't deliver the savings.


To add on to what the sibling said, ignoring that CISC chips have a separate frontend to break complex instructions down into an internal RISC-like instruction set and thus the difference is blurred, more RISC instruction sets do tend to win on performance and power for the main reason that the instruction set has a fixed width. This means that you can fetch a line of cache and 4 byte instructions you could start decoding 32 instructions in parallel whereas x86’d variableness makes it harder to keep the super scalar pipeline full (it’s decoder is significantly more complex to try to still extract parallelism which further slows it down). This is a bit more complex on ARM (and maybe RISCV?) where you have two widths but even then in practice it’s easier to extract performance out of it because x86 can be anywhere from 1-4 bytes (or 1-8? Can’t remember) which makes it hard to find boundary instructions in parallel.

There’s a reason that Apple is whooping AMD and Intel on performance/watt and it’s not solely because they’re on a newer fab process (it’s also why AMD and Intel utterly failed to get mobile CPU variants of their chips off the ground).


x86 instruction lengths range from 1 to 15.

> a line of cache and 4 byte instructions you could start decoding 32 instructions in parallel

In practice, ARM processors decode up to 4 instructions in parallel; so do Intel and AMD.


Apple's m1 chips are 8 wide. and AMD and Intel's newest chips are also doing more fancy things than 4 wide


Any reading resources? I’d love to learn better the techniques they’re using to get better parsllelism. The most obvious solution I can imagine is that they’d just try to brute force starting to execute every possible boundary and rely on it either decoding an invalid instruction or late latching the result until it got confirmed that it was a valid instruction boundary. Is that generally the technique or are they doing more than even that? The challenge with this technique of course is that you risk wasting energy & execution units on phantom stuff vs an architecture that didn’t have as much phantomness potential in the first place.


https://chipsandcheese.com/2024/08/14/amds-ryzen-9950x-zen-5... is a pretty good overview of the microarchitecture. I don't think they say how they get there, because trade secrets.


But we define the RISC dream as a dream that efficiency, performance and low-cost could be achieved by cores with very small instruction sets?


Not small instruction sets, simplified instruction sets. RISC’s main trick is to reduce the number of addressing modes (eg, no memory indirect instructions) and reduce the number of memory operands per instruction to 0 or 1. Use the instruction encoding space for more registers instead.

The surviving CISCs, x86 and z390 are the least CISCy CISCs. The surviving RISCs, arm and power, are the least RISCy RISCs.

RISC V is a weird throwback in some aspects of its instruction set design.


More details on how RISCy or CISCy various chips are:

https://userpages.umbc.edu/~vijay/mashey.on.risc.html

Notably x86 is one of the less CISCy CISCs so it looks like there might be a happy medium.


Lets be real, its about business models. POWER was and is backed by IBM. ARM won on mobile. Does this mean POWER and ARM are better then MIPS, SPARC, PA-RISC, Am29000, i860? I don't think so.


If adding more instructions negatively impacts efficiency, performance, cost and complexity, nobody would do it.


Probably true now, but in ye olde days, some instructions existed primarily to make assembly programming more convenient.

Assembly programming is a real pain in the RISCiest of RISC architectures, like SPARC. Here's an example from https://www.cs.clemson.edu/course/cpsc827/material/Code%20Ge...:

• All branches (including the one caused by CALL, below) take place after execution of the following instruction.

• The position immediately after a branch is the “delay slot” and the instruction found there is the “delay instruction”.

• If possible, place a useful instruction in the delay slot (one which can safely be done whether or not a conditional branch is taken).

• If not, place a NOP in the delay slot.

• Never place any other branch instruction in a delay slot.

• Do not use SET in a delay slot (only half of it is really there).


Delay slots were such a hack. ARM never needed them.


Only if decoder complexity/ efficiency is you bottleneck


In this particular context, they're trying to run code compiled for x86_64 on RISCV5. The need from "we just need a few more instructions to get feature parity" comes from trying to run code that is already compiled for an architecture with all those extra instructions.

In theory, if you compiled the original _source_ code for RISC, you'd get an entirely binary and wouldn't need those specific instructions.

In practice, I doubt anyone is going to actually compile these games for RISCV5.


The explanation that I've seen is that it's "(reduced instruction) set computer" - simple instructions, not necessarily few.


Beyond the most trivial of microcontrollers and experimental designs there are no RISC chips under the original understanding of RISC. The justification for RISC evaporated when we became able to put 1 million, 100 million, and so on, transistors on a chip. Now all the chips called "RISC" include vector, media, encryption, network, FPUs, and etc. instructions. Someone might want to argue that some elements of RISC designs (orthogonal instruction encoding, numerous registers, etc.) make a particular chip a RISC chip. But they really aren't instances of the literal concept of RISC.

To me, the whole RISC-V interest is all just marketing. As an end user I don't make my own chips and I can't think of any particular reason I should care whether a machine has RISC-V, ARM, x86, SPARC, or POWER. In the end my cost will be based on market scale and performance. The licensing cost of the design will not be passed on to me as a customer.


That screenshot shows 31 gb of ram which is distinctly more than the mentioned dev board at max specs. Are they using something else here?


Pioneer, an older board.

Note that, today, one of the recent options with several, faster cores implementing RVA22 and RVV 1.0 is the better idea.



The milk-v pioneer comes with 128GB of RAM.


Is this the 86Box? I found it fun reliving the time I got my Amstrad PC1512, I added two hard cards of 500MB and a 128k memory expansion to 640KB which made things a lot more fun. Back then I only had two 360KB floppies and added a 32MB hard card a few years later. I had Borland TurboPascal and Zortech C too. Fun times.


No, it's Box64, a completly different project.

(But I do remember the time I had an Amstrad PC1512 too :D )


It will be interesting to try out Box64 as soon as I get my hands on some suitable RISCV hardware. I have played with RISCV microcontrollers they're quite nice to work with.


I wonder if systems will ship at some point that are a handful of big RISC-V CPUs, and then a “GPU” implemented as a bunch of little RISC-V CPUs (with the appropriate vector stuff—actually, side-question, can classic vectors, instead of packed SIMD, be useful in a GPU?)


Another technically impressive Witcher 3 feat was the Switch port, it ran really well. Goes to show how much can be done with optimization and how much resources are wasted on the PC purely by bad optimization.


And with using much lower quality textures and 3D models, therefore using much less RAM for assets. It's not an apples to apples comparison and you can't really make claims about bad optimization on PCs when the scope of what's shown on screen is vastly different.


You too can run Witcher 3 equally on a minimal PC if you're willing to set the render resolution to 720p (540p undocked), settings to below minimum, and call ~30 FPS well.


I hope they're able to get this ISA-level feedback to people at RVI


The scalar efficiency SIG has already been discussing bitfield insert and extract instructions.

We figured out yesterday [1], that the example in the article can already be done in four risc-v instructions, it's just a bit trickier to come up with it:

    # a0 = rax, a1 = rbx
    slli t0, a1, 64-8
    rori a0, a0, 16
    add a0, a0, t0
    rori a0, a0, 64-16
[1] https://www.reddit.com/r/RISCV/comments/1f1mnxf/box64_and_ri...


Nice trick, in fact with 4 instructions it's as efficient as extract/insert and it works for all ADD/SUB/OR/XOR/CMP instructions (not for AND), except if the source is a high-byte register. However it's not really a problem if code generation is not great in this case: compilers in practice will not generate accesses to these registers, and while old 16-bit assembly code has lots of such accesses it's designed to run on processors that ran at 4-20 MHz.

Flag computation and conditional jumps is where the big optimization opportunities lie. Box64 uses a multi-pass decoder that computes liveness information for flags and then computes flags one by one. QEMU instead tries to store the original operands and computes flags lazily. Both approaches have advantages and disadvantages...


> except if the source is a high-byte register

That's just one more instruction, to right-align the AH, BH etc src operand prior to exactly the same instructions as above.

And, yes, this being 64 bit code compilers won't be generating such instructions. In fact they started avoiding them as soon as OoO hit in the Pentium Pro, P II, P III etc in the mid 90s because of "partial register update stalls".


Actually, Box64 can also store operands for later computation, depending on what comes next...


Author here, we have adopted this approach as a fast path to box64: https://github.com/ptitSeb/box64/pull/1763, thank you very much!


None of this is new. None of it.

In fact, bitfield extract is such an obvious oversight that it is my favourite example of how idiotic the RISCV ISA is (#2 is lack of sane addressing modes).

Some of the better RISCV designs, in fact, implement a custom instr to do this, eg: BEXTM in Hazard3: https://github.com/Wren6991/Hazard3/blob/stable/doc/hazard3....


Whoa, someone else who doesn't believe that the RISC-V ISA is 'perfect'! I'm curious: how the discussions on the bitfield extract have been going? Because it does really seem like an obvious oversight and something to add as a 'standard extension'.

What's your take on

1) unaligned 32bit instructions with the C extension?

2) lack of 'trap on overflow' for arithmetic instructions? MIPS had it..


IMHO they made a mistake by not allowing immediate data to follow instructions. You could encode 8 bit constants within the opcode, but anything larger should be properly supported with immediate data. As for the C extension, I think that was also inferior because it was added afterward. I'd like to see a re-encoding of the entire ISA in about 10 years once things are really stable.


The main problem with what you’re saying is that none of the lessons learned are new. They were all well-known before this ISA was designed, so if the designers had any intention of learning from the past, they had every opportunity to do so.


The handling of misaligned loads/stores in RISC-V is also can be considered a disappointing point: https://github.com/riscv/riscv-isa-manual/issues/1611 It oozes with preferring convenience of hardware developers and "flexibility" over making practical guarantees needed by software developers. It looks like the MIPS patent on misaligned load/store instructions has played its negative role. The patent expired in 2019, but it seems we are stuck with the current status quo nevertheless.


1. aarch64 does this right. RISCV tries to be too many things at once, and predictably ends up sucking at everything. Fast big cores should just stick to fixed size instrs for faster decode. You always know where instrs start, and every cacheline has an integer number of instrs. microcontroler cores can use compressed intrs, since it matters there, while trying to parallel-codec instrs does not matter there. Trying to have one arch cover it all is idiotic.

2. nobody uses it on mips either, so it is likely of no use.


> Fast big cores should just stick to fixed size instrs for faster decode.

How much faster, though? RISC-V decode is not crazy like x86, you only need to look at the first byte to know how long the instruction is (the first two bits if you limit yourself to 16 and 32-bit instructions, 5 bits if you support 48-bits instructions, 6 bits if you support 64-bits instructions). Which means, the serial part of the decoder is very very small.

The bigger complain about variable length instruction is potentially misaligned instructions, which does not play well with cache lines (a single instruction may start in a cache line and end at the next, making hardware a bit more hairy).

And there’s an advantage to compressed instructions even on big cores: less pressure on the instruction cache, and correspondingly fewer cache misses.

Thus, it’s not clear to me that fixed size instructions is the obvious way to go for big cores.


Another argument against the C extension is that it uses a big chunk of the opcode space, which may be better used for other extensions with 32-bit instructions.


Are just 32-bit and naturally aligned 64 bit instruction a better path than fewer 32 bit, but 16/48/64 bit instructions?

I think it's quite unclear which one is better. 48-bit instructions have a lot of potential imo, they have better code density then naturally aligned 64 bit instructions, and they can encode more that 32-bit. (2/3 to 3/4 of 43-bits of encoding)

There are essentially two design philosophies:

1. 32-bit instructions, and 64 bit naturally aligned instructions

2. 16/32/48/64 bit instructions with 16 bit alignment

Implementation complexity is debatable, although it seems to somewhat favor options 1:

1: you need to crack instructions into uops, because your 32-bit instructions need to do more complex things

2: you need to find instruction starts, and handle decoding instructions that span across a cache line

How big the impact is relative to the entire design is quite unclear.

Finding instruction starts means you need to propagate a few bits over your entire decode width, but cracking also requires something similar. Consider that if you can handle 8 uops, then those can come from the first 4 instructions that are crackes into 2 uops each, or from 8 instructions that don't need to be cracked, and everything in between. With cracking, you have more freedom when you want to do it in the pipeline, but you still have to be able to handle it.

In the end, both need to decode across cachelines for performance, but one needs to deal with an instruction split across those cache lines. To me this sounds like it might impact verification complexity more than the actual implementation, but I'm not qualified enough to know.

If both options are suited for high performance implementations, then it's a question about tradeoffs and ISA evolution.


There is also a middle ground of requiring to pad 16/48-bit sequences with 16-bit NOP to align them to 32 bits. I agree that at this time it's not clear whether the C extension is a good idea or not (same with the V extension).


The C extension authors did consider requiring alignment/padding to prevent the misaligned 32-bit instruction issues, but they specifically mention rejecting it since it ate up all the code size savings.


Did they specifically analyze doing alignment on a cache line basis?


This would require specifying a cache line size in the ABI, which is a somewhat odd uarch detail to bubble up. While 64-bytes is conventional for large application processors and has been for a long time, I wouldn't want to make it a requirement.


It's definitely worth analyzing though.

See how big of a block you need to get 90% of the compression benefit, etc.


that seems really tough for compilers.


Not really. Most modern x86 compilers already align jump targets to cache line boundaries since this helps x86 a lot. So it is doable. If you compile each function into a section (common), then the linker can be told to align them to 64 or 128 bytes easily. Code size would grow (but tetris can be played to reduce this by packing functions)


Frankly, there is no advantage to compressed instructions in a high performance CPU core as a misaligned instruction can span a memory page boundary, which will generate a memory fault, potentially a TLB flush, and, if the memory page is not resident in memory, will require an I/O operation. Which is much worse than crossing a cache line. It is a double whammy when both occur simultaneously.

One suggested solution has been filling in gaps with NOP's, but then the compiler would have to track the page alignment, which would not work anyway if a system supports pages of varying sizes (ordinary vs huge pages).

The best solution is perhaps to ignore compressed instructions when targeting high performance cores and confine their usage to where they belong: power efficient or low performance microcontrollers.


Page crossing affects a minuscule amount of cases - with 4096B pages and 100% non-compressed instructions (but still somehow 50% of the time misaligned), it affects only one in 2048 instructions.

The possibility of I/O is in no way exclusive to compressed instructions. If the page-crossing instruction was padded, the second page would need to be faulted in required anyway. All that matters is number of pages of code needed for the piece of code, which is simply just code size.

The only case that actually has a chance of mattering simply is just crossing cachelines.

And I would imagine high-performance cores would have some internal instruction buffer anyway, for doing cross-fetch-block instruction fusion and whatnot.


> One suggested solution has been filling in gaps with NOP's, but then the compiler would have to track the page alignment, which would not work anyway if a system supports pages of varying sizes (ordinary vs huge pages).

If it's in the linker then tracking pages sounds pretty doable.

You don't need to care about multiple page sizes. If you pad at the minimum page size, or even at 1KB boundaries, that's a miniscule number of NOPs.


Fixed size instructions are not absolutely necessary, but keeping them naturally aligned is just better even if that means using C instructions a bit less often. It's especially messy that 32-bit instructions can span a page.


>2. nobody uses it on mips either, so it is likely of no use.

Sure but at the time Rust, Zig didn't exist, these two languages have a mode which detects integer overflow..


Bitfield-extract is being discussed for a future extension. E.g. Qualcomm is pressing for it to be added.

In the meantime, it can be done as two shifts: left to the MSB, and then right filling with zero or sign bits. There is at least one core in development (SpaceMiT X100) that is supposed to be able to fuse those two into a single µop, maybe some that already do.

However, I've also seen that one core (XianShan Nanhu) is fusing pairs of RVI instructions into one in the B extension, to be able to run old binaries compiled for CPUs without B faster. Throwing hardware at the problem to avoid a recompile ... feels a bit backwards to me.


I'm not very familiar with the ecosystem, but I have used this on an RPi4 to run some games through wine.

I'm wondering, how's the landscape nowadays. Is this the leading project for x86 compatibility on ARM? With the rising popularity of the architecture for consumer platforms, I'd guess companies like Valve would be interested in investing in these sort of translation layers.


Previously: https://news.ycombinator.com/item?id=19118642

And:

Milk-V Pioneer A 64-core, RISC-V motherboard and workstation for native development

https://www.crowdsupply.com/milk-v/milk-v-pioneer


lol, I am going the other way around.

Since RISC-V ISA is worldwide royalty free and more than nice, I am writting basic rv64 assembly which I do interpret on x86_64 hardware with a linux kernel.

I did not push the envelop up to have a "compiler", because it is indeed while waiting for hardcore performant desktop, aka large, rv64 hardware implementations.


I used to use GL4ES on the PocketCHIP. And I daily use it on a netbook to get more performance on some GL 2.1 games.


Box86 is so good, I run x86-64 steam games ( servers ) on free Oracle instance ( ARM64 ) with it.


Great game choice!


I remember learning RISC-V in Berkeley CS61C. Anyone from Berkeley?


There's nobody from Berkeley on HN


oh really, didn't know that. Me neither. That course was open-sourced.


wow very impressive


box64 is getting too advanced lol


> The x86 instruction set is very very big. According to rough statistics, the ARM64 backend implements more than 1,600 x86 instructions in total, while the RV64 backend implements about 1,000 instructions

This is just insane and gets us full-circle to why we want RISC-V.


I think the 1600 number is a coarse metric for this sort of thing. Keep in mind that these instructions are limited in the number of formal parameters they can take: e.g. 16 nominally distinct instructions can be more readily understood/memorized as one instruction with an implicit 4-bit flag. Obviously there's a ton of legacy cruft in Intel ISAs, along with questionable decisions, and I'm not trying to take away from the appeals of RISC (e.g. there are lots of outstanding compiler bugs around these "pseudoparamaterized" instructions). But it's easy to look at "1600" and think "ridiculous bloat," when in reality it's somewhat coherent and systematic - and more to the point, clearly necessary for highly performance-sensitive work.


> clearly necessary for highly performance-sensitive work

Its clearly necessary to have comparability back to the 80s. Its clearly necessary to have 10 different generation of SIMD. Its clearly necessary to have multiple different floating point systems.


If an insane instruction set gives us higher performance and makes CPU and compiler design more complex, this might be an acceptable trade-off.


But it doesn't.

Its simply about the amount of investment. x86 had 50 years of gigantic amounts of sustained investment. Intel outsold all the RISC vendors combined by like 100 to 1 because they owned the PC business.

When Apple started seriously investing in ARM. They were able to match of beat x86 laptops.

The same will be true for RISC-V.


ARM64 has approximately 1300 instructions.


I want somebody to make a GPT fine tune that specializes in converting instructions and writing tests. If you made it read all x86 docs a bunch and risc v docs, a lot of this could be automated.


Not really. RISC-V's benefits are not the "Reduced Instruction Set" part, it's the open ISA part. A small instruction set as actually has several disadvantages. It means you binary bigger because what was a single operation in x86 is now several in RISC-V, meaning more memory bandwidth and cache is taken up by instructions instead of data.

Modern CPUs are actually really good at deciding operations into micro-ops. And the flexibility of being able to implement a complex operation in microcode, or silicon is essential for CPU designers.

Is there a bunch of legacy crap in x86? Yeah. Does getting rid of dramatically increase the performance ceiling? Probably not.

The real benefit of RISC-V is anybody can use it. It's democratizing the ISA. No one has to pay a license to use it, they can just build their CPU design and go.


> Modern CPUs are actually really good at deciding operations into micro-ops.

The largest out-of-order CPUs are actually quite reliant on having high-performance decode that can be performed in parallel using multiple hardware units. Starting from a simplified instruction set with less legacy baggage can be an advantage in this context. RISC-V is also pretty unique among 64-bit RISC ISA's wrt. including compressed instructions support, which gives it code density comparable to x86 at a vastly improved simplicity of decode (For example, it only needs to read a few bits to determine which insns are 16-bit vs. 32-bit length).


> means you binary bigger .... meaning more memory bandwidth and cache

Except this isn't actually true.

> Does getting rid of dramatically increase the performance ceiling? Probably not.

No but it dramatically DECREASES the amount of investment necessary to reach that ceiling.

Assume you have 2 teams, each get the same amount of money. Then ask them to make the highest performing spec compatible chip. What team is gone win 99% of the time?

> And the flexibility of being able to implement a complex operation in microcode, or silicon is essential for CPU designers.

You can add microcode to a RISC-V chip if you want, most people just don't want to.

> The real benefit of RISC-V is anybody can use it.

That is true, but its also just a much better instruction set then x86 -_-


>It means you binary bigger

False premise, as size tool shows RVA20(RV64GC) binaries were already smallest among 64bit architectures.

Code gets smaller still (rather than larger) with newer extensions such as B in RVA22.

As of recently, the same is true in 32bit when comparing rv32 against former best (thumb2). But it was quite close before to begin with.


>15 fps in-game

Wow...that's substantially more than I would have guessed. Good times ahead for hardware


"which allows games like Stardew Valley to run, but it is not enough for other more serious Linux games"

Hey! ;-)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: