So SIMD would need to set the overflow flag also to catch em. Which would be muc...

adrian_b · 2024-07-14T08:22:19 1720945339

AVX-512 has evolved from the Larrabee New Instructions (2009), passing through Knights Ferry (2010), Knights Corner (2012) and Knights Landing (2016), to reach Skylake Server (2017), whose set of AVX-512 instructions has remained a subset of the instruction sets of all later CPUs with AVX-512 support.

At each step from Larrabee to Skylake Server, some instructions have been lost, because the initial set of instructions was more complete in order to enable the writing of efficient GPU algorithms, while later Intel believed that for a general-purpose CPU they can reduce the costs by omitting some of those instructions.

(Nevertheless, later they have added many other instructions, some of which may be less useful and more expensive than the original instructions that have been removed.)

Among the original Larrabee instructions that have been deleted, was addition with unsigned overflow (a.k.a. carry), where the output overflow flags were stored in a mask register, enabling their use in a later conditional SIMD instruction.

Signed overflow can be implemented in hardware with negligible additional complexity (a single gate per each result number), so it would have been easy to also add to Larrabee/AVX-512 an addition instruction with signed overflow flags stored in a mask register. Even when only unsigned overflow is available, it is possible to preprocess the operands in such a way that detecting signed overflow would be possible with the unsigned overflow bits, though that requires multiple instructions, slowing down a lot the algorithm.

However in this problem the numbers that are added are non-negative, so the addition with unsigned overflow of the original Larrabee ISA would have been sufficient, had Intel not removed it from AVX-512.

dzaima · 2024-07-14T14:19:48 1720966788

As a minor note, overflow checking for add & sub can be done reasonably-efficiently in software by comparing the wrapping result with a saturating one, good for i8/i16/u8/u16 on x86, should be roughly the same for signed overflow compared to repurposing hardware unsigned overflow (though of course worse than would-be-native for unsigned).

And, regardless, this would be at least one more uop in the core loop (and a somewhat-unpredictable branch at that) which you'd still want to avoid.