For x86 cores this is visible in Agner Fog's instruction performance tables: htt...

wtallis · 2024-12-30T16:13:06 1735575186

I believe the throughput shown in those tables is the total throughput for the whole CPU core, so it isn't immediately obvious which instructions have high throughput due to pipelining within an execution unit and which have high throughput due just to the core having several execution units capable of handling that instruction.

BeeOnRope · 2024-12-30T16:34:01 1735576441

That's true, but another part of the tables show how many "ports" the operation can be executed on, which is enough information to concluded an operation is pipelined.

For example, for many years Intel chips had a multiplier unit on a single port, with a latency of 3 cycles, but an inverse throughput of 1 cycle, so effectively pipelined across 3 stages.

In any case, I think uops.info [1] has replaced Agner for up-to-date and detailed information on instruction execution.

---

[1] https://uops.info/table.html

Earw0rm · 2024-12-30T18:00:09 1735581609

Shame it doesn't seem to have been updated with Arrow Lake, Zen 5 and so on yet.

BeeOnRope · 2024-12-30T22:52:21 1735599141

Yes. In the past new HW has been made available to the uops.info authors in order to run their benchmark suite and publish new numbers: I'm not sure if that just hasn't happened for the new stuff, or if they are not interested in updating it.

ajross · 2024-12-30T16:40:43 1735576843

FWIW, there are two ideas of parallelism being conflated here. One is the parallel execution of the different sequential steps of an instruction (e.g. fetch, decode, operate, retire). That's "pipelining", and it's a different idea than decoding multiple instructions in a cycle and sending them to one of many execution units (which is usually just called "dispatch", though "out of order execution" tends to connote the same idea in practice).

The Fog tables try hard show the former, not the latter. You measure dispatch parallelism with benchmarks, not microscopes.

Also IIRC there are still some non-pipelined units in Intel chips, like the division engine, which show latency numbers ~= to their execution time.

BeeOnRope · 2024-12-30T16:46:11 1735577171

I don't think anyone is talking about "fetch, decode, operate, retire" pipelining (though that is certainly called pipelinig): only pipelining within the execution of a instruction that takes multiple cycles just to execute (i.e., latency from input-ready to output-ready).

Pipelining in stages like fetch and decode are mostly hidden in these small benchmarks, but are visible when there are branch misprediction, other types of flushes, I$ misses and so on.

ajross · 2024-12-30T18:47:34 1735584454

> I don't think anyone is talking about "fetch, decode, operate, retire" pipelining (though that is certainly called pipelinig): only pipelining within the execution of a instruction that takes multiple cycles just to execute (i.e., latency from input-ready to output-ready).

I'm curious what you think the distinction is? Those statements are equivalent. The circuit implementing "an instruction" can't work in a single cycle, so you break it up and overlap sequentially issued instructions. Exactly what they do will be different for different hardware, sure, clearly we've moved beyond the classic four stage Patterson pipeline. But that doesn't make it a different kind of pipelining!

BeeOnRope · 2024-12-30T22:56:44 1735599404

We are interested in the software visible performance effects of pipelining. For small benchmarks that don't miss in the predictors or icache, this mostly means execution pipelining. That's the type of pipelining the article is discussing and the type of pipelining considered in instruction performance breakdowns considered by Agner, uops.info, simulated by LLVM-MCA, etc.

I.e., a lot of what you need to model for tight loops only depends on the execution latencies (as little as 1 cycle), and not on the full pipeline end-to-end latency (almost always more than 10 cycles on big OoO, maybe more than 20).

eigenform · 2024-12-30T23:26:30 1735601190

Adding to this: the distinction is that an entire "instruction pipeline" can be [and often is] decomposed into many different pipelined circuits. This article is specifically describing the fact that some execution units are pipelined.

Those are different notions of pipelining with different motivations: one is motivated by "instruction-level parallelism," and the other is motivated by "achieving higher clock rates." If 64-bit multiplication were not pipelined, the minimum achievable clock period would be constrained by "how long it takes for bits to propagate through your multiplier."

tjoff · 2024-12-31T13:08:35 1735650515

> one is motivated by "instruction-level parallelism," and the other is motivated by "achieving higher clock rates."

Which are exactly the same thing? For exactly the same reasons?

Sure, you can focus your investigation on one or the other but that doesn't change what they are or somehow change the motivations for why it is being done.

And you can have a shorter clock period than your non-pipelined multiplier just fine. Just that other uses of that multiplier would stall in the meantime.

formerly_proven · 2024-12-30T19:23:54 1735586634

Independently scheduled and queued execution phases are qualitatively different from a fixed pipeline.

gpderetta · 2024-12-30T21:00:30 1735592430

An OoO design is qualitatively different from an in-order one because of renaming and dynamic scheduling, but the pipelining is essentially the same and for the same reasons.

gpderetta · 2024-12-30T21:04:31 1735592671

> and it's a different idea than decoding multiple instructions in a cycle and sending them to one of many execution units (which is usually just called "dispatch", though "out of order execution

Being able to execute multiple instructions is more properly superscalar execution, right? In-order designs are also capable of doing it and the separate execution unit do not even need to run in lockstep (consider the original P5 U and V pipes).

wtallis · 2024-12-30T21:29:07 1735594147

Right; it's easy to forget that superscalar CPU cores don't actually have to be in-order, but most of them are out-of-order because that's usually necessary to make good use of a wide superscalar core.

(What's the best-performing in-order general purpose CPU core? POWER6 was notably in-order and ran at quite high clock speeds for the time. Intel's first-gen Atom cores were in-order and around the same time as POWER6 but at half the clock speed. SPARC T3 was ran at an even lower clock speed.)

gpderetta · 2024-12-30T22:08:06 1735596486

POWER6 might indeed have been the last In-Order speed demon.

sillywalk · 2024-12-31T03:59:26 1735617566

The IBM Z10 came out a year later. It was co-designed with POWER6 as part of IBM's eClipz project, and shared a number of features / design choices, including in-order execution.

ajross · 2024-12-30T21:19:35 1735593575

In-order parallel designs are "VLIW". The jargon indeed gets thick. :)

But as to OO: the whole idea of issuing sequential instructions in parallel means that the hardware needs to track dependencies between them so they can't race ahead of their inputs. And if you're going to do that anyway, allowing them to retire out of order is a big performance/transistor-count win as it allows the pipeline lengths to be different.

gpderetta · 2024-12-30T22:07:00 1735596420

VLIW is again a different thing. It uses a single instruction that encodes multiple independent operations to simplify decoding and tracking, usually with exposed pipelines.

But you can have, for example, a classic in-order RISC design that allows for parallel execution. OoO renaming is not necessary for dependency tracking (in fact even scalar in order CPUs need dependency tracking to solve RAW and other hazards), it is "only" needed for executing around stalled instructions (while an in order design will stall the whole pipeline).

Again P5 (i.e the original Pentium) was a very traditional in order design, yet could execute up to two instructions per cycle.

ajross · 2024-12-30T22:15:21 1735596921

> VLIW is again a different thing.

No it isn't. I'm being very deliberate here with refusing pedantry. In practice, "multiple dispatch" means "OO" in the same way that "VLIW" means "parallel in order dispatch". Yes, you can imagine hypothetical CPUs that mix the distinction, but they'd be so weird that they'd never be built. Discussing the jargon without context only confuses things.

> you can have, for example, a classic in-order RISC design that allows for parallel execution.

Only by inventing VLIW, though, otherwise there's no way to tell the CPU how to order what it does. Which is my point; the ideas are joined at the hip. Note that the Pentium had two defined pipes with specific rules about how the pairing was encoded in the instruction stream. It was, in practice, a VLIW architecture (just one with a variable length encoding and where most of the available instruction bundles only filled one slot)! Pedantry hurts in this world, it doesn't help.

wtallis · 2024-12-31T02:24:03 1735611843

> Note that the Pentium had two defined pipes with specific rules about how the pairing was encoded in the instruction stream. It was, in practice, a VLIW architecture (just one with a variable length encoding and where most of the available instruction bundles only filled one slot)!

This is ridiculous. There are no nop-filled slots in the instruction stream, and you can't even be sure which instructions will issue together unless you trace backwards far enough to find a sequence of instructions that can only be executed on port 0 and thus provide a known synchronization point. The P5 only has one small thing in common with VLIW, and there's already a well-accepted name for that feature, and it isn't VLIW.

ajross · 2025-01-03T18:46:42 1735930002

Meh. P5's decode algorithm looks very much like VLIW to me, and emphatically not like the 21264/P6 style of dispatch that came to dominate later. I find that notable, and in particular I find senseless adherence to jargon definitions[1] hurts and doesn't help in this sphere. Arguing about how to label technology instead of explaining what it does is a bad smell.

[1] That never really worked anyway. ia64, Transmeta's devices and Xtensa HiFi are all "VLIW" by your definition yet work nothing like each other.

gpderetta · 2024-12-30T22:28:13 1735597693

I'm sorry, but if P5 was VLIW then the word has lost all meanings. They couldn't possibly be more different.

ajross · 2025-01-03T18:47:44 1735930064

The point was exactly that "VLIW" as a term has basically no meaning. What it "means" in practice is parallel in-order dispatch (and nothing about the instruction format), which is what I said upthread.

atq2119 · 2024-12-31T06:17:32 1735625852

VLIW means Very Large Instruction Word. It is a property of the instruction set, not of the processor that implements it.

You could have a VLIW ISA that is implemented by a processor that "unrolls" each instruction word and mostly executes the constituent instructions serially.

tliltocatl · 2024-12-31T14:09:53 1735654193

Also you can have out-of-order VLIW. Later Itaniums were like that, because turns VLIW doesn't help much with random memory access latency.

eigenform · 2024-12-30T22:26:56 1735597616

> Also IIRC there are still some non-pipelined units in Intel chips, like the division engine, which show latency numbers ~= to their execution time

I don't think that's accurate. That latency exists because the execution unit is pipelined. If it were not pipelined, there would be no latency. The latency corresponds to the fact that "doing division" is distributed across multiple clock cycles.

BeeOnRope · 2025-01-01T21:21:59 1735766519

Division is complicated by the fact that it is a complex micro-coded operation with many component micro-operations. Many or all of those micro-operations may in fact be pipelined (e.g., 3/1 lat/itput) , but the overall effect of executing a large number of them looks not very pipelined at all (e.g., 20 of them on a single EU would have 22/20 lat/itput, basically not pipelined when examined at that level).

eigenform · 2024-12-31T05:04:37 1735621477

Sorry, correcting myself here: it's cut across multiple cycles but not pipelined. Maybe I confused this with multiplication?

If it were pipelined, you'd expect to be able to schedule DIV every cycle, but I don't think that's the case. Plus, 99% of the time the pipeline would just be doing nothing because normal programs aren't doing 18 DIV instructions in a row :^)