Not disagreeing, but I think that was a bit inaccurate.
If that branch is mispredicted, we're talking about 12-20 cycles. Ok, I assume it's a range check and thus (nearly) always not taken. So if it's in hot path, it'll always be correctly predicted. Modern CPUs will most likely fuse cmp+jae into one micro-op, so predicted-not-taken + mov will take 2 cycles (+latency).
"cmp/jae/cmp/jne/mov" will of course be fused into 3 micro-ops. But don't you mean "cmp/jae/cmp/je/mov"? I'm assuming second compare is a NULL check (or at least that instructions are ordered that way second branch is practically never taken). I think that also takes 2 cycles (both branches execute on same clock cycle + mov), but not sure how fused predicted-not-takens behave.
L3 miss for that mov, well... might well be 200 cycles.
Ah yeah, I wasn't sure if fusion was going to happen. You're probably right in macro-op terms; sorry about that.
The first compare is a bounds check against the array backing the pool, and the second compare is against the type field on the interface, not a null check. Golang interfaces are "fat pointers" with two words: a data pointer and a vtable pointer. So the first cmp is against a register, while the second cmp is against memory, data dependent on the register index. The address of the cmp has to be at least checked to determine if it faults, so I would think at least some part of it would have to be serialized after the first branch, making it slower than the version without the type guard.
If that branch is mispredicted, we're talking about 12-20 cycles. Ok, I assume it's a range check and thus (nearly) always not taken. So if it's in hot path, it'll always be correctly predicted. Modern CPUs will most likely fuse cmp+jae into one micro-op, so predicted-not-taken + mov will take 2 cycles (+latency).
"cmp/jae/cmp/jne/mov" will of course be fused into 3 micro-ops. But don't you mean "cmp/jae/cmp/je/mov"? I'm assuming second compare is a NULL check (or at least that instructions are ordered that way second branch is practically never taken). I think that also takes 2 cycles (both branches execute on same clock cycle + mov), but not sure how fused predicted-not-takens behave.
L3 miss for that mov, well... might well be 200 cycles.