This surprises a lot of programmers. A barrel shifter is a big circuit for an ALU. Smaller than a multiplier obviously, but much larger than a ripple carry adder and somewhat larger even than the fancy pipelined adders used in fast CPUs.
It's the kind of thing that got skimped on in Atom. Lots of stuff runs surprisingly slow on Atom. Another one I find striking is that there are two pipes for ADD/SUB instructions, but ADC/SBB (the carry/borrow variants) are not just single-issue, but apparently unpipelined. Intel lists their throughput at 3 instead of 0.5!
> Intel lists their throughput at 3 instead of 0.5!
Agner Fog lists ADC/SBB on Atom as having latency 2 and throughput 1/2. I guess Intel could have made them run with ADD/SUB-level performance in cases where there are no intra-pair dependencies on the carry/borrow flag, but the extra interlocks for handling that were probably not worth the cost.
Nope. Usually it's the reverse: Atom, for example, can only do one shift per cycle, but it can dual-issue adds.