> Right now, most devices on the market do not support the C extension This is n...

FullyFunctional · on April 30, 2024

The expansion of a 16-bit C insn to 32-bit isn't the problem. That part is trivial. The problem (and it is significant) is for a highly speculative superscalar machine that fetches 16+ instructions at a time but cannot tell the boundary of instructions until they are all decoded. Sure, it can be done, but that doesn't mean that it doesn't cost you in mispredict penalties (AKA IPC) and design/verification complexities that could have gone to performance.

It is also true that burning up the encoding space for C means pain elsewhere. Example: branch and jump offsets are painfully small. So small that all non-toy code need to use a two instruction sequence to all call (and sometimes more).

These problems don't show up on embedded processors and workloads. They matter for high performance.

camel-cdr · on April 30, 2024

> me but cannot tell the boundary of instructions until they are all decoded

Not fully decoded though, since it's enough to look at the lower bits to determine instruction size.

> Sure, it can be done, but that doesn't mean that it doesn't cost you in mispredict penalties

What does decoding have to do with mispredict penalties?

> Example: branch and jump offsets are painfully small

Yes, thats what the 48 bit instruction encoding is for. See e.g. what the scalar eficiency SIG is currently working on: https://docs.google.com/spreadsheets/u/0/d/1dQYU7QQ-SnIoXp9v...

inkyoto · on May 1, 2024

> Not fully decoded though, since it's enough to look at the lower bits to determine instruction size.

It is not about decoding, which happens later, it is about 32-bit instructions crossing the L1 cache line boundary in the L1-i cache which happens first.

Instructions are fetched from the L1-i cache in bundles (i.e. cache lines), and the size of the bundle is fixed for a specific CPU model. In all RISC CPU's, the size of a cache line is a multiply of the instruction size (mostly 32 bits). The RISC-V C extension breaks the alignment, which incurs a performance penalty for high performance CPU implementations, but is less significant for smaller, low power implementations where performance is not a concern.

If a 32-bit instruction cross the cache line boundary, another cache line must be fetched from the L1-i cache before an instruction can be decoded. The performance penalty in such a scenario is prohibitive for a very fast CPU core.

P.S. Even worse if the instruction crosses a page boundary, and the page is not resident in memory.

dzaima · on May 1, 2024

I don't think crossing cache lines is particularly much of a concern? You'll necessarily be fetching the next cache line in the next cycle anyway to decode further instructions (not even an unconditional branch could stop this I'd think), at which point you can just "prepend" the chopped tail of the preceding bundle (and you'd want some inter-bundle communication for fusion regardless).

This does of course delay decoding this one instruction by a cycle, but you already have that for instructions which are fully in the next line anyways (and aligning branch targets at compile time improves both, even if just to a fixed 4 or 8 bytes).

inkyoto · on May 1, 2024

> I don't think crossing cache lines is particularly much of a concern?

It is a concern if a branch prediction has failed, and the current cache line has to be discarded or has been invalidated. If the instruction crosses the cache line boundary, both lines have to be discarded. For a high-performance CPU core, it is a significant and, most importantly, unnecessary performance penalty. It is not a concern for a microcontroller or a low power design, though.

dzaima · on May 1, 2024

Why does an instruction crossing cache lines have anything to do with invalidation/discarding? RISC-V doesn't require instruction cache coherency so the core doesn't have much restriction on behavior if the line was modified, so all restrictions go to just explicit synchronization instructions. And if you have multiple instructions in the pipeline, you'll likely already have instructions from multiple cache lines anyways. I don't understand what "current cache line" even entails in the context of a misprediction, where the entire nature of the problem is that you did not have any idea where to run code from, and thus shouldn't know of any related cache lines.

dzaima · on May 1, 2024

Mispredict penalties == latency of pipeline. Needing to delay decoding/expansion to after figuring out where instructions actually start will necessarily add a delay of some number of gates (whether or not this ends up in mispredict penalty increasing by any cycles of course depend on many things).

That said, the alternative of instruction fission (i.e. that which RISC-V avoids requiring) would add some delay too (I have no clue how these compare though, I'm not a hardware engineer; and RISC-V does benefit from instruction fusion which can similarly add latency, and whose requirement other architectures could decide to try to avoid (though it'd be harder to keep avoiding it as hardware potential improves while old compiled binary blobs stay unchanged), so it's complicated)

camel-cdr · on May 1, 2024

Ah, that makes sense, thanks. I think on the end it all boils down to both the arm and the rv approach to be fine approaches, with slightly different tradeoffs.