Great! I heard somewhere here on HN that a modern x86 decoder is smaller than a ...

zhemao · on Jan 3, 2016

> a modern x86 decoder is smaller than a modern arm decoder

That's because the ARM ISA is not small either by any stretch of the imagination. On the other hand, the instruction listing of the base RISC-V ISA and the standard extensions can fit on a single powerpoint slide.

http://riscv.org/workshop-jun2015/riscv-intro-workshop-june2...

I wasn't involved in any of the recent tape-outs, so I can't say exactly how big the decoder is. But it's quite small relative to the other chip components. Currently, the integer pipeline of the chip is roughly the same size as the FPU, and these two together are roughly the same size as the L1 cache. All of those components together are smaller than the L2 cache (depends on the size of the L2 cache, though). So decoder size doesn't really matter in the grand scheme of things.

Decoder speed probably does matter, though. Currently, we can decode an instruction in a single cycle (1 ns). The x86 decoder, on the other hand, can take multiple cycles depending on instruction. But maybe this isn't a fair comparison since the instructions are decomposed into uops. I have no idea about the performance of ARM decoders.

wolf550e · on Jan 4, 2016

How can you be multiscalar with a decoder that only does 1 ops/cycle? Intel does 6:

> From the original Core 2 through Haswell/Broadwell, Intel has used a four-wide front-end for fetching instructions. Skylake is the first change to this aspect in roughly a decade, with the ability to now fetch up to six micro-ops per cycle. Intel doesn’t indicate how many execution units are available in Skylake’s back-end, but we know everything from Core 2 through Sandy Bridge had six execution units while Haswell has eight execution ports. We can assume Skylake is now more than eight, and likely the ability to dispatch more micro-ops as well, but Intel didn’t provide any specifics.

http://www.maximumpc.com/idf-2015-san-francisco-skylake-deep...

luismarques · on Jan 4, 2016

I think he meant that the decoding latency is 1 cycle, not that per 1 cycle the core can only decode one instruction.

That is, each baby takes 9 cycles to form, but per 9 cycles the population can have more than one baby.

Symmetry · on Jan 4, 2016

He was talking about latency, you're talking about throughput.

Symmetry · on Jan 3, 2016

Is that for a decoder that can decode multiple instructions per clock cycle? I think it would be somewhat interesting for a single instruction decoder but it would be quite remarkable for a decoder of greater width since x86 instructions aren't even self synchronizing (you can read the same sequence of bytes in different valid ways depending on where you start) while ARM is fixed width.

FullyFunctional · on Jan 4, 2016

There are two, Berkeley's BOOM and another from Macaque Labs in India (SHAKTI OO core).

Symmetry · on Jan 4, 2016

Those don't seem to be tiny x86 decoders, as far as I can tell.